Skip to content

Prediction of Survival of passengers using Machine Learning (sklearn)

Notifications You must be signed in to change notification settings

samongithb/Titanic-Survival-Prediction-NITSIKKIM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Titanic Survival Prediction

By 
Vatsala Gupta B190150EC
Sameer Chettri B190063EC
Simran Dutta B190066EC

titanic

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we are building a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Content of the Code

1. Loading Data
2. Analysing the Data
3. Feature Engineering
4. KNN ML Model
5. Logistic Regression ML Model
6. Prediction

CODE

#Importing basic libraries and loading data sets

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math
%matplotlib inline

# To manage ticker in plot
from matplotlib.ticker import MaxNLocator

Here we are importing important libraries like pandas to create dataframe, numpy for execution of mathematical functions on array, and seaborn & matplotlib for plotting data.

train_data=pd.read_csv('/Users/sam/Documents/ML/train.csv')
train_data.head(5)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Here we are loading the datfiles.

print('number of passenger in training data:'+str(len(train_data.index)))
number of passenger in training data:891
test_data=pd.read_csv('/Users/sam/Documents/ML/test.csv')
print('number of passenger in testing data:'+str(len(test_data.index)))
number of passenger in testing data:418

#Analyzing Data

# Let pandas show all columns
pd.options.display.width = 0

# For better viusalisation increase horizontal space for subplots
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.4)

# Seaborn theme setings
sns.set_theme(style = "whitegrid", palette="deep")
<Figure size 432x288 with 0 Axes>

Now we will merge both testing and training data to analyze the correlation between the features.

#merging training and testing data for missing values imputation
len_train=len(train_data)
data_all=pd.concat(objs=[train_data,test_data],axis=0).reset_index(drop=True)
data_all.info()
print('number of passenger in all data:'+str(len(data_all.index)))
data_all.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
number of passenger in all data:1309
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

We will check the missing data of the features so that we can use those features to predict.

#check missing data
null_data=data_all.isnull().sum()
null_data[null_data>0]
Survived     418
Age          263
Fare           1
Cabin       1014
Embarked       2
dtype: int64

We will be converting some features into categorical data.

#Converting some columns to categories and category labels to discrete numbers
data_all["Cabin_Group"] = data_all["Cabin"].str[:1]
data_all["Cabin_Group"] = data_all["Cabin_Group"].astype('category')
cabin_group_categories = dict(enumerate(data_all["Cabin_Group"].cat.categories))

data_all["Cabin_Group"] = data_all["Cabin_Group"].cat.codes
data_all["Cabin_Group"] = data_all["Cabin_Group"].astype(int)

data_all["Sex"] = data_all["Sex"].astype('category')
sex_categories = dict(enumerate(data_all["Sex"].cat.categories))
data_all["Sex"] = data_all["Sex"].cat.codes
data_all["Sex"] = data_all["Sex"].astype(int)

data_all["Embarked"] = data_all["Embarked"].astype('category')
embarked_categories = dict(enumerate(data_all["Embarked"].cat.categories))
data_all["Embarked"] = data_all["Embarked"].cat.codes
data_all["Embarked"] = data_all["Embarked"].astype(int)

data_all = data_all.convert_dtypes()
data_all.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   Int64  
 1   Survived     891 non-null    Int64  
 2   Pclass       1309 non-null   Int64  
 3   Name         1309 non-null   string 
 4   Sex          1309 non-null   Int64  
 5   Age          1046 non-null   Float64
 6   SibSp        1309 non-null   Int64  
 7   Parch        1309 non-null   Int64  
 8   Ticket       1309 non-null   string 
 9   Fare         1308 non-null   Float64
 10  Cabin        295 non-null    string 
 11  Embarked     1309 non-null   Int64  
 12  Cabin_Group  1309 non-null   Int64  
dtypes: Float64(2), Int64(8), string(3)
memory usage: 145.9 KB
print('\nEmbarked:')
print(embarked_categories)
print('\nCabin Group:')
print(cabin_group_categories)
print('\nSex:')
print(sex_categories)
Embarked:
{0: 'C', 1: 'Q', 2: 'S'}

Cabin Group:
{0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'T'}

Sex:
{0: 'female', 1: 'male'}

PLOTS and imputing missing data

#fare
fig,axes = plt.subplots(nrows=3, ncols=3, figsize=(15,15))

axes[0,0].xaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,1].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,2].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[2,2].yaxis.set_major_locator(MaxNLocator(integer=True))

sns.histplot(ax=axes[0,0], data=data_all, x="Fare")
sns.scatterplot(ax=axes[0,1], data=data_all, x="Fare", y="Survived")
sns.scatterplot(ax=axes[0,2], data=data_all, x="Fare", y="Pclass")
sns.scatterplot(ax=axes[1,0], data=data_all, x="Fare", y="Age")
sns.scatterplot(ax=axes[1,1], data=data_all, x="Fare", y="Sex")
sns.scatterplot(ax=axes[1,2], data=data_all, x="Fare", y="SibSp")
sns.scatterplot(ax=axes[2,0], data=data_all, x="Fare", y="Parch")
sns.scatterplot(ax=axes[2,1], data=data_all, x="Fare", y="Cabin_Group")
sns.scatterplot(ax=axes[2,2], data=data_all, x="Fare", y="Embarked")

plt.show()

png

#checking the one missing fare data
data_all[data_all["Fare"].isnull()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Cabin_Group
1043 1044 <NA> 3 Storey, Mr. Thomas 1 60.5 0 0 3701 <NA> <NA> 2 -1
#filling the one missing fare data
m = data_all[data_all["Pclass"]==3]["Fare"].median()
data_all["Fare"] = data_all["Fare"].fillna(m)
#embarked 
fig,axes = plt.subplots(nrows=3, ncols=3, figsize=(15,15))

# Make ticks integer for discrete values
axes[0,0].xaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,1].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,2].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[1,0].yaxis.set_major_locator(MaxNLocator(integer=True))

sns.histplot(ax=axes[0,0], data=data_all, x="Embarked")
sns.boxplot(ax=axes[0,1], data=data_all, x="Embarked",y="Survived")
sns.boxplot(ax=axes[0,2], data=data_all, x="Embarked",y="Pclass")
sns.boxplot(ax=axes[1,0], data=data_all, x="Embarked",y="Sex")
sns.boxplot(ax=axes[1,1], data=data_all, x="Embarked",y="Age")
sns.boxplot(ax=axes[1,2], data=data_all, x="Embarked",y="SibSp")
sns.boxplot(ax=axes[2,0], data=data_all, x="Embarked",y="Parch")
sns.boxplot(ax=axes[2,1], data=data_all, x="Embarked",y="Fare")
sns.boxplot(ax=axes[2,2], data=data_all, x="Embarked",y="Cabin_Group")

plt.show()

png

#checking missing embarked values
data_all[data_all["Embarked"]==-1]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Cabin_Group
61 62 1 1 Icard, Miss. Amelie 0 38.0 0 0 113572 80.0 B28 -1 1
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) 0 62.0 0 0 113572 80.0 B28 -1 1

Fare column shows us Embarked is "C" and and Pclass column confirms it Also we see that Cabin Names start with B belongs to port 0 (= C) We can fill missing Embarked value with "0"

Most frequent embarkation point is 2(=S).

Pclass at embarkation point 1(=Q) is 3 with some outliers. Embarkation point 2(=S) majors on Pclass 2 and 3.

The people at embarkation point 1(=Q) are somewhat younger than the others. Also most of them possibly do not have a family.

The embarkation point 0(=C) mostly consists of families with children. In every Pclass, there are children.

The people at embarkation point 0(=C) pays more for tickets.

Cabin_Group 0(=A) and 1(=B) mostly uses embarkation point 0(=C)

#imputing missing embarked values
data_all["Embarked"] = data_all["Embarked"].replace(-1,0)
#cabin group
fig,axes = plt.subplots(nrows=3, ncols=3,figsize=(15,15))

axes[0,1].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,2].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[1,0].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[2,2].yaxis.set_major_locator(MaxNLocator(integer=True))

sns.histplot(ax=axes[0,0], data=data_all, x="Cabin_Group")
sns.boxplot(ax=axes[0,1], data=data_all, x="Cabin_Group", y="Survived")
sns.boxplot(ax=axes[0,2], data=data_all, x="Cabin_Group", y="Pclass")
sns.boxplot(ax=axes[1,0], data=data_all, x="Cabin_Group", y="Sex")
sns.boxplot(ax=axes[1,1], data=data_all, x="Cabin_Group", y="Age")
sns.boxplot(ax=axes[1,2], data=data_all, x="Cabin_Group", y="SibSp")
sns.boxplot(ax=axes[2,0], data=data_all, x="Cabin_Group", y="Parch")
sns.boxplot(ax=axes[2,1], data=data_all, x="Cabin_Group", y="Fare")
sns.boxplot(ax=axes[2,2], data=data_all, x="Cabin_Group", y="Embarked")

plt.show()

png

A high amount of Cabin_Group data is missing.

Except missing data: Cabin_Group 1(=B), 3(=D), 4(=E) was mostly survived and Cabin_Group 7(=T) did not survive.

Missing Cabin_Group data mostly belongs to Pclass 2 and 3. While Cabin_Group 5(=F) belongs to Pclass 2 and 3, it is 3 for Cabin_Group 6(=G)

People at Cabin_Group 0(=A) and 7(=T) is male, it is female for Cabin_Group 6(=G)

Age range is mostly 20-35 for missing Cabin_Group data.

Cabin_Group 0(=A) consists of older people. Young ones is at Cabin_Group 5(=F) and Cabin_Group 6(=G)

Except Cabin_Group 0(=A) and 7(=T), SibSp is not descriptive for Cabin_Group. It is similar for Parch.

People at Cabin_Group 2(=C) pays more for tickets.

Missing Cabin_Group data majors on embarkation point 2(=S)

#imputing missing values in cabin group
missing_survived = len(data_all[(data_all['Cabin_Group']==-1) &(data_all["Survived"]==1)])
missing_not_survived = len(data_all[(data_all['Cabin_Group']==-1) &(data_all["Survived"]==0)])

cabin_null_count = len(data_all[(data_all['Cabin_Group']==-1)])
                               
print("Survived Percentage in Missing Cabin Values : ", '{:.0%}'.format(missing_survived / cabin_null_count))
print("Not-Survived Percentage in Missing Cabin Values : ", '{:.0%}'.format(missing_not_survived/ cabin_null_count))
Survived Percentage in Missing Cabin Values :  20%
Not-Survived Percentage in Missing Cabin Values :  47%

Since cabin_group gives good survival prediction so we will keep cabin_group and drop cabin

data_all = data_all.drop(['Cabin'], axis=1)
#age
fig,axes = plt.subplots(nrows=3, ncols=3, figsize=(15,15))

sns.histplot(ax=axes[0,0], data=data_all, x="Age",bins=10)
sns.boxplot(ax=axes[0,1], data=data_all, x="Pclass", y="Age")
sns.boxplot(ax=axes[0,2], data=data_all, x="Sex", y="Age")
sns.boxplot(ax=axes[1,0], data=data_all, x="SibSp", y="Age")
sns.boxplot(ax=axes[1,1], data=data_all, x="Parch", y="Age")
sns.scatterplot(ax=axes[1,2], data=data_all, x="Fare", y="Age")
sns.boxplot(ax=axes[2,0], data=data_all, x="Cabin_Group", y="Age")
sns.boxplot(ax=axes[2,1], data=data_all, x="Embarked", y="Age")
fig.delaxes(axes[2,2])

plt.show()

png

In the light of the above inspection we will use Pclass, SibSp, Parch, Cabin_Group, Survived and Embarked features to impute Age feature using decision tree.

#imputing missing age
from sklearn.tree import DecisionTreeRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

data_impute_dtree = data_all.copy()
data_impute_dtree = data_impute_dtree.drop(["Name", "PassengerId","Ticket", "Sex","Fare"], axis=1)

dtr = DecisionTreeRegressor()
imp = IterativeImputer(estimator=dtr, missing_values=np.nan, max_iter=500, verbose=0, imputation_order='roman', random_state=42, min_value=0)
x_imputed = imp.fit_transform(data_impute_dtree)

data_impute_dtree["MF_Age"] = x_imputed[:,2]
data_impute_dtree["MF_Age"] = data_impute_dtree["MF_Age"].astype(int)
#Pclass
fig,axes = plt.subplots(nrows=3, ncols=3, figsize=(15,15))

axes[0,0].xaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,1].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,2].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[2,2].yaxis.set_major_locator(MaxNLocator(integer=True))

sns.histplot(ax=axes[0,0], data=data_all, x="Pclass")
sns.boxplot(ax=axes[0,1], data=data_all, x="Pclass", y="Survived")
sns.boxplot(ax=axes[0,2], data=data_all, x="Pclass", y="Sex")
sns.boxplot(ax=axes[1,0], data=data_all, x="Pclass", y="Age")
sns.boxplot(ax=axes[1,1], data=data_all, x="Pclass", y="SibSp")
sns.boxplot(ax=axes[1,2], data=data_all, x="Pclass", y="Parch")
sns.boxplot(ax=axes[2,0], data=data_all, x="Pclass", y="Fare")
sns.boxplot(ax=axes[2,1], data=data_all, x="Pclass", y="Cabin_Group")
sns.boxplot(ax=axes[2,2], data=data_all, x="Pclass", y="Embarked")

plt.show()

png

Pclass 3 mostly not survived.

Pclass 1 consists of older people. Younger people is at Pclass 3.

Families are mostly at Pclass 2.

Pclass 1 pays more for tickets.

#sex vs others
fig,axes = plt.subplots(nrows=3, ncols=3, figsize=(15,15))

axes[0,0].xaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,1].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,2].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[2,2].yaxis.set_major_locator(MaxNLocator(integer=True))

sns.histplot(ax=axes[0,0], data=data_all, x="Sex")
sns.boxplot(ax=axes[0,1], data=data_all, x="Sex", y="Survived")
sns.boxplot(ax=axes[0,2], data=data_all, x="Sex", y="Pclass")
sns.boxplot(ax=axes[1,0], data=data_all, x="Sex", y="Age")
sns.boxplot(ax=axes[1,1], data=data_all, x="Sex", y="SibSp")
sns.boxplot(ax=axes[1,2], data=data_all, x="Sex", y="Parch")
sns.boxplot(ax=axes[2,0], data=data_all, x="Sex", y="Fare")
sns.boxplot(ax=axes[2,1], data=data_all, x="Sex", y="Cabin_Group")
sns.boxplot(ax=axes[2,2], data=data_all, x="Sex", y="Embarked")

plt.show()

png

The number of men is (about) twice the number of women.

The female mostly survived.

Males major at Pclass 2 and 3.

Females are slightly younger.

Mostly females have Sibling/Spouse/Children.

Females pay more for tickets.

Females mostly embarked at 1(=2) and 2(=S)

#SibSp(siblings or spouse on board)
fig,axes = plt.subplots(nrows=3, ncols=3, figsize=(15,15))

axes[0,0].xaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,1].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[1,1].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,2].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[2,2].yaxis.set_major_locator(MaxNLocator(integer=True))

sns.histplot(ax=axes[0,0], data=data_all, x="SibSp")
sns.boxplot(ax=axes[0,1], data=data_all, x="SibSp", y="Survived")
sns.boxplot(ax=axes[0,2], data=data_all, x="SibSp", y="Pclass")
sns.boxplot(ax=axes[1,0], data=data_all, x="SibSp", y="Age")
sns.boxplot(ax=axes[1,1], data=data_all, x="SibSp", y="Sex")
sns.boxplot(ax=axes[1,2], data=data_all, x="SibSp", y="Parch")
sns.boxplot(ax=axes[2,0], data=data_all, x="SibSp", y="Fare")
sns.boxplot(ax=axes[2,1], data=data_all, x="SibSp", y="Cabin_Group")
sns.boxplot(ax=axes[2,2], data=data_all, x="SibSp", y="Embarked")

plt.show()

png

Most people have no sibling/spouse. If they have it is probably one.

More than 2 SibSp dramatically decreases survival possibility.

There are more survivors at PClass 1 (Middle aged, embarked at 1(=Q))

More than 2 sibling/spouse means you are young (The more siblings the less age.)

#Parch(parents or children on board)
fig,axes = plt.subplots(nrows=3, ncols=3, figsize=(15,15))

axes[0,0].xaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,1].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[1,1].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[0,2].yaxis.set_major_locator(MaxNLocator(integer=True))
axes[2,2].yaxis.set_major_locator(MaxNLocator(integer=True))

sns.histplot(ax=axes[0,0], data=data_all, x="Parch")
sns.boxplot(ax=axes[0,1], data=data_all, x="Parch", y="Survived")
sns.boxplot(ax=axes[0,2], data=data_all, x="Parch", y="Pclass")
sns.boxplot(ax=axes[1,0], data=data_all, x="Parch", y="Age")
sns.boxplot(ax=axes[1,1], data=data_all, x="Parch", y="Sex")
sns.boxplot(ax=axes[1,2], data=data_all, x="Parch", y="SibSp")
sns.boxplot(ax=axes[2,0], data=data_all, x="Parch", y="Fare")
sns.boxplot(ax=axes[2,1], data=data_all, x="Parch", y="Cabin_Group")
sns.boxplot(ax=axes[2,2], data=data_all, x="Parch", y="Embarked")

plt.show()

png

Parch has some relation with Age. It has certain characteristics. More Parent/Children means paying more to tickets.

data_final = data_impute_dtree.copy()
data_final["Age"] = data_final["MF_Age"]
data_final = data_final.drop("MF_Age", axis=1)
data_final["Name"] = data_all["Name"]
data_final["Sex"] = data_all["Sex"]
data_final["Fare"] = data_all["Fare"]

#Feature Engineering

Pclass:

#Pclass
fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
axes[1].yaxis.set_major_locator(MaxNLocator(integer=True))

axes[0].tick_params('x', labelrotation=60)
axes[1].tick_params('x', labelrotation=60)

sns.countplot(ax=axes[0], x="Pclass", data=data_final)
sns.boxplot(ax=axes[1], data=data_final, x="Pclass", y="Survived")

plt.show()

png

Pclass is a good feature to predict survival.

Name:

data_final["Name"].head()
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: string
data_final["Title"] = [i.split(".")[0].split(",")[-1].strip() for i in data_final["Name"]]
#plotting titles in the name
fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
axes[1].yaxis.set_major_locator(MaxNLocator(integer=True))

axes[0].tick_params('x', labelrotation=60)
axes[1].tick_params('x', labelrotation=60)

sns.countplot(ax=axes[0], x="Title", data=data_final)
sns.boxplot(ax=axes[1], data=data_final, x="Title", y="Survived")

plt.show()

png

data_final["Title"].unique()
array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer', 'Dona'], dtype=object)
data_final["Title"] = data_final["Title"].replace(["Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer"], "Rare_Male")
data_final["Title"] = data_final["Title"].replace(["Lady", "the Countess", "Dona", "Mme", "Ms", "Mlle"], "Rare_Female")

fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
axes[1].yaxis.set_major_locator(MaxNLocator(integer=True))

axes[0].tick_params('x', labelrotation=45)
axes[1].tick_params('x', labelrotation=45)

sns.countplot(ax=axes[0], x="Title", data=data_final)
sns.boxplot(ax=axes[1], data=data_final, x="Title", y="Survived")

plt.show()

png

The title in the name gives good prediction of survival so we can use it.

#converting titles to categories and discrete values
data_final["Title"] = data_final["Title"].astype('category')
title_categories = dict(enumerate(data_final["Title"].cat.categories))
data_final["Title"] = data_final["Title"].cat.codes
data_final["Title"]  =data_final["Title"].astype(int)

data_final.drop(labels=["Name"], axis=1, inplace=True)
data_final["Embarked"] = data_final["Embarked"].astype('category')

title_categories
{0: 'Master', 1: 'Miss', 2: 'Mr', 3: 'Mrs', 4: 'Rare_Female', 5: 'Rare_Male'}

Sex:

fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
axes[1].yaxis.set_major_locator(MaxNLocator(integer=True))

axes[0].tick_params('x', labelrotation=45)
axes[1].tick_params('x', labelrotation=45)

sns.countplot(ax=axes[0], x="Sex", data=data_final)
sns.boxplot(ax=axes[1], data=data_final, x="Sex", y="Survived")

plt.show()

png

Sex of the people is a good feature to predict survival.

Age:

fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(30,15))
axes[1].yaxis.set_major_locator(MaxNLocator(integer=True))

axes[0].tick_params('x', labelrotation=90)
axes[1].tick_params('x', labelrotation=90)

sns.countplot(ax=axes[0], x="Age", data = data_final)
sns.scatterplot(ax=axes[1], data=data_final, x="Age", y="Survived",s=70)

plt.show()

png

Age can be used to predict survival.

Parch+SibSp=Fmly_count:

data_final["Fmly_Count"] = data_final["SibSp"] + data_final["Parch"] + 1

fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
axes[1].yaxis.set_major_locator(MaxNLocator(integer=True))

axes[0].tick_params('x', labelrotation=45)
axes[1].tick_params('x', labelrotation=45)

sns.countplot(ax=axes[0], x="Fmly_Count", data=data_final)
sns.boxplot(ax=axes[1], data=data_final, x="Fmly_Count", y="Survived")

plt.show()

png

Fmly_count can be used to predict survival.

Fare:

fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
axes[1].yaxis.set_major_locator(MaxNLocator(integer=True))

axes[0].tick_params('x', labelrotation=45)
axes[1].tick_params('x', labelrotation=45)

sns.histplot(ax=axes[0], x="Fare", data=data_final)
sns.scatterplot(ax=axes[1], data=data_final,x ="Fare",y="Survived")

plt.show()

png

Fare can be used to predict

Cabin_group:

fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
axes[1].yaxis.set_major_locator(MaxNLocator(integer=True))

axes[0].tick_params('x', labelrotation=45)
axes[1].tick_params('x', labelrotation=45)

sns.countplot(ax=axes[0], x="Cabin_Group", data=data_final)
sns.boxplot(ax=axes[1], data=data_final, x="Cabin_Group", y="Survived")

plt.show()

png

Cabin_group can be used for prediction

Embarked:

fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
axes[1].yaxis.set_major_locator(MaxNLocator(integer=True))

axes[0].tick_params('x', labelrotation=45)
axes[1].tick_params('x', labelrotation=45)

sns.countplot(ax=axes[0], x="Embarked", data=data_final)
sns.boxplot(ax=axes[1], data=data_final, x="Embarked", y="Survived")

plt.show()

png

Embarked doesn't gives us some meaningful insight to predict

#Training and testing data

#get dummies
data_final = pd.get_dummies(data_final, columns=["Pclass", "Title", "Sex", "Fmly_Count", "Cabin_Group", "Embarked"])
#dropping unnecessary
data_final.drop(labels=['SibSp','Parch'],axis=1,inplace=True)
data_final.head(5)
Survived Age Fare Pclass_1 Pclass_2 Pclass_3 Title_0 Title_1 Title_2 Title_3 ... Cabin_Group_1 Cabin_Group_2 Cabin_Group_3 Cabin_Group_4 Cabin_Group_5 Cabin_Group_6 Cabin_Group_7 Embarked_0 Embarked_1 Embarked_2
0 0 22 7.25 0 0 1 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 1
1 1 38 71.2833 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 1 0 0
2 1 26 7.925 0 0 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 1 35 53.1 1 0 0 0 0 0 1 ... 0 1 0 0 0 0 0 0 0 1
4 0 35 8.05 0 0 1 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 35 columns

Next we will be splitting the test data and training data as the original.

#splitting of testing and training data
train_data = data_final[:len_train]
test_data = data_final[len_train:]

train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 35 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Survived        891 non-null    Int64  
 1   Age             891 non-null    int64  
 2   Fare            891 non-null    Float64
 3   Pclass_1        891 non-null    uint8  
 4   Pclass_2        891 non-null    uint8  
 5   Pclass_3        891 non-null    uint8  
 6   Title_0         891 non-null    uint8  
 7   Title_1         891 non-null    uint8  
 8   Title_2         891 non-null    uint8  
 9   Title_3         891 non-null    uint8  
 10  Title_4         891 non-null    uint8  
 11  Title_5         891 non-null    uint8  
 12  Sex_0           891 non-null    uint8  
 13  Sex_1           891 non-null    uint8  
 14  Fmly_Count_1    891 non-null    uint8  
 15  Fmly_Count_2    891 non-null    uint8  
 16  Fmly_Count_3    891 non-null    uint8  
 17  Fmly_Count_4    891 non-null    uint8  
 18  Fmly_Count_5    891 non-null    uint8  
 19  Fmly_Count_6    891 non-null    uint8  
 20  Fmly_Count_7    891 non-null    uint8  
 21  Fmly_Count_8    891 non-null    uint8  
 22  Fmly_Count_11   891 non-null    uint8  
 23  Cabin_Group_-1  891 non-null    uint8  
 24  Cabin_Group_0   891 non-null    uint8  
 25  Cabin_Group_1   891 non-null    uint8  
 26  Cabin_Group_2   891 non-null    uint8  
 27  Cabin_Group_3   891 non-null    uint8  
 28  Cabin_Group_4   891 non-null    uint8  
 29  Cabin_Group_5   891 non-null    uint8  
 30  Cabin_Group_6   891 non-null    uint8  
 31  Cabin_Group_7   891 non-null    uint8  
 32  Embarked_0      891 non-null    uint8  
 33  Embarked_1      891 non-null    uint8  
 34  Embarked_2      891 non-null    uint8  
dtypes: Float64(1), Int64(1), int64(1), uint8(32)
memory usage: 50.6 KB

We will be using 'standard scaler' from sklearn library to scale the data.

x_train=train_data.drop('Survived',axis=1)
from sklearn.preprocessing import StandardScaler
x_train = StandardScaler().fit_transform(x_train)
#y_train=pd.DataFrame( train_data["Survived"])
#y_train=y_train.to_records()
y_train = train_data["Survived"]
y_train = y_train.astype('int')
print(y_train)
0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

Next we create validation set from our training data to calculate the accuracy of the model. Validation set is a part of training data which is used as testing data to calculate accuracy.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(StandardScaler().fit_transform(x_train), y_train, test_size=0.33, random_state=42)

First we will be using K-Nearest Neighbour(KNN) ML model on our input features and calculate its accuracy on validation set.

#Prediction using KNN

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

knn_mod = KNeighborsClassifier()
knn_mod.fit(x_train,y_train)

y_prediction_train=knn_mod.predict(x_train)
y_prediction_test=knn_mod.predict(x_test)

a=confusion_matrix(y_train,y_prediction_train)
print(a)
# outcome values order in sklearn
tp, fn, fp, tn =a.reshape(-1)
print('Outcome values : \n', tp, fn, fp, tn)
# classification report for precision, recall f1-score and accuracy
matrix = classification_report(y_train,y_prediction_train,labels=[1,0])
print('Classification report : \n',matrix)

b=confusion_matrix(y_test,y_prediction_test)
print(b)
# outcome values order in sklearn
tp, fn, fp, tn =b.reshape(-1)
print('Outcome values : \n', tp, fn, fp, tn)
# classification report for precision, recall f1-score and accuracy
matrix = classification_report(y_test,y_prediction_test,labels=[1,0])
print('Classification report : \n',matrix)
[[350  24]
 [ 57 165]]
Outcome values : 
 350 24 57 165
Classification report : 
               precision    recall  f1-score   support

           1       0.87      0.74      0.80       222
           0       0.86      0.94      0.90       374

    accuracy                           0.86       596
   macro avg       0.87      0.84      0.85       596
weighted avg       0.86      0.86      0.86       596

[[148  27]
 [ 39  81]]
Outcome values : 
 148 27 39 81
Classification report : 
               precision    recall  f1-score   support

           1       0.75      0.68      0.71       120
           0       0.79      0.85      0.82       175

    accuracy                           0.78       295
   macro avg       0.77      0.76      0.76       295
weighted avg       0.77      0.78      0.77       295

Here we can see that the training accuracy of our model is 86% and the testing accuracy is 78%. Next we will be considering another ML model called Logistic Regression on our data to check whether it is better that KNN or not.

#Prediction using logistic regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logreg = LogisticRegression()
logreg.fit(x_train, y_train)

accuracy_train = round(logreg.score(x_train, y_train) * 100, 2) 
accuracy_test = round(logreg.score(x_test, y_test) * 100, 2)

print("Training Accuracy: % {}".format(accuracy_train))
print("Testing Accuracy: % {}".format(accuracy_test))


y_prediction_train=logreg.predict(x_train)
y_prediction_test=logreg.predict(x_test)

a=confusion_matrix(y_train,y_prediction_train)
print(a)
# outcome values order in sklearn
tp, fn, fp, tn =a.reshape(-1)
print('Outcome values : \n', tp, fn, fp, tn)
# classification report for precision, recall f1-score and accuracy
matrix = classification_report(y_train,y_prediction_train,labels=[1,0])
print('Classification report : \n',matrix)

b=confusion_matrix(y_test,y_prediction_test)
print(b)
# outcome values order in sklearn
tp, fn, fp, tn =b.reshape(-1)
print('Outcome values : \n', tp, fn, fp, tn)
# classification report for precision, recall f1-score and accuracy
matrix = classification_report(y_test,y_prediction_test,labels=[1,0])
print('Classification report : \n',matrix)
Training Accuracy: % 85.23
Testing Accuracy: % 83.73
[[335  39]
 [ 49 173]]
Outcome values : 
 335 39 49 173
Classification report : 
               precision    recall  f1-score   support

           1       0.82      0.78      0.80       222
           0       0.87      0.90      0.88       374

    accuracy                           0.85       596
   macro avg       0.84      0.84      0.84       596
weighted avg       0.85      0.85      0.85       596

[[154  21]
 [ 27  93]]
Outcome values : 
 154 21 27 93
Classification report : 
               precision    recall  f1-score   support

           1       0.82      0.78      0.79       120
           0       0.85      0.88      0.87       175

    accuracy                           0.84       295
   macro avg       0.83      0.83      0.83       295
weighted avg       0.84      0.84      0.84       295

Now we can see that the training accuracy is 85% and the testing accuracy is 84%. We can clearly see that, logistic regression model is more accurate than KNN model on testing data. So we will be using logistic regression model to predict the survival of the passengers in test.csv

#prediction by sklearn
x_test1 = test_data.drop("Survived", axis=1)
x_test1 = StandardScaler().fit_transform(x_test1)

y_predict=logreg.predict(x_test1)
test_data["Survived"] = y_predict.tolist()

test_data = pd.read_csv('/Users/sam/Documents/ML/test.csv')
test_data.set_index("PassengerId")
test_data["Survived"] = y_predict.tolist()
test_data[["PassengerId","Survived"]].to_csv("submission1.csv", index=False)
/var/folders/l2/tq_sfhyj76988knrq9ghdflm0000gn/T/ipykernel_2504/2294351978.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data["Survived"] = y_predict.tolist()

We will store the predicted output in submission1.csv, which will have passenger id and the prediction of their survival.

df=pd.read_csv("/Users/sam/Documents/ML/submission1.csv")
df.head(100)
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 1
... ... ...
95 987 0
96 988 1
97 989 0
98 990 1
99 991 0

100 rows × 2 columns

About

Prediction of Survival of passengers using Machine Learning (sklearn)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published