-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path6_dimensionality_reduction.py
239 lines (188 loc) · 7.85 KB
/
6_dimensionality_reduction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
# To add a new cell, type '#%%'
# To add a new markdown cell, type '#%% [markdown]'
#%%
from IPython import get_ipython
#%% [markdown]
#<h1> SIT 720 - Python Intro </h1>
#%% [markdown]
# ### PCA
#%%
# import required modules
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
#%%
data = pd.read_csv('/Users/ragyibrahim/Documents/Uni/2019/Trimester 2/SIT720/AT1/data.csv', delimiter=',', header = None).values
print(data.shape)
#%% [markdown]
# #### Plot the data
#%%
plt.plot(data[:,0],data[:,1], '.', markersize=14)
plt.title('Original Data')
plt.show()
#%% [markdown]
# #### Normalise the data
#%%
mu = data.mean(axis=0) # means of each column
sigma = data.std(axis=0) #std of each column
print(mu)
print(sigma)
print("#####")
##Use mean and std to normalise data
Xnorm = ((data-mu)/sigma)
print(Xnorm[0:5,:])
#%% [markdown]
# #### Compute covariance matrix of normalised data
#%% [markdown]
# If m is the number of training data, calculate the covariance matrix as:
# $\sum_{}^{} =\frac{1}{m}Xnorm^TXnorm$
#%%
m = len(Xnorm)
covmatrix = np.dot(Xnorm.T, Xnorm)/m
print(covmatrix)
#%% [markdown]
# #### Compute Eigenvectors and Eigenvalues
#%% [markdown]
# Now, compute the eigenvalues(S) and eigenvectors (U) of this covariance matrix. The eigenvectors(U) become the principal components. We use **linalg.eig()** function from *numpy* to compute the eigenvalues and eigenvectors of a square array.
#%%
S,U = np.linalg.eig(covmatrix) #The linalg.eig() function returns 2 arrays. The first is the eigenvalues and the second contains the corresponding eignevectors. Therefore by doing S,U = np.linalg.eig(covmatrix) we will store the first array (eigenvalues) in S and the second array (eigenvectors) in U
print('Eignevalues: {}'.format(S))
print('Eignevectors: ')
print(U)
#%% [markdown]
# #### Project data points unto new principle axis (Eigenvectors)
#%% [markdown]
# This is done by a simple linear transformation
#%%
Z = np.dot(Xnorm, U)
print(Z[0:5,:])
print('#############')
print(Xnorm[0:5,:])
#%% [markdown]
# #### Lets visualise the data before and after PCA
#%%
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(18,5)) #Create side-by-side plots
fig.subplots_adjust(wspace=0.2) # leave some space between figs
#Plot the original data
axs[0].scatter(data[:,0], data[:,1])
axs[0].set_title("Original data")
#Plot Uncorrelated data
axs[1].scatter(Z[:,0], Z[:,1])
axs[1].set_title("Uncorrelated Data")
#%% [markdown]
# #### Dimensionality Reduction - Of uncorrelated data
#%% [markdown]
# To reduce the dimensionality of our $2D$ data to $1D$, we remove the principle component that captures the least variation. Our principle components, which are the eigen vectors of the covariance matrix are: $U[:,0]$ and $U[:,1]$. By projecting our data Xnorm onto just $U[:,0]$, we get a reduced $Z$ in $1D$.
#
# In general, we decide to keep k eigenvectors in $U$ that captures maximum variation. Then our reduced data $Znorm$ becomes:
#
# $\text{Xnorm}_{M\times M} \times \text{Ureduced}_{M\times k}=\text{Z}_{M\times k}$
#%%
#In the case where we want k=1 (i.e. reduce rank to 1)
k=1 # number of principal components to retain
U_red = U[:,0:k] # choose the first k principal components
#project our data Xnorm onto reduced U
Z_red = np.dot(Xnorm, U_red)
print(Z_red.shape)
print(U_red.shape)
#%% [markdown]
# #### Visualise data after dimensionality reduction
#%%
#Project our Xnorm data Using Z_red
X_rec = np.dot(Z_red, U_red.T)
#Visualize the recovered data
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(18,5))
fig.subplots_adjust(wspace=0.2) # leave some space between figs
#Plot Xnorm data
axs[0].scatter(Xnorm[:,0], Xnorm[:,1])
axs[0].set_title("Normalised Original Data")
#Plot For "Recovered" Data
axs[1].scatter(X_rec[:,0], X_rec[:,1])
axs[1].set_title("Data after Dimensionality reduction")
#%% [markdown]
# #### Measuring ‘reconstruction error’
#%% [markdown]
# How much information did we lose after dimensionality reduction? To measure this, we calculate the reconstruction error of our data. Reconstruction error, is calculated as the square root of sum of squared errors of each data point. Essentially, this becomes the distance between the original data point and the reconstructed data point.
#
# $ \text{Reconstruction Error}= \sqrt\ \left( \sum_{i=1}^{M}\sum_{j=1}^{N}(\text{Xnorm}_{ij}-\text{Xrec}_{ij})^{2}\right)$
#
# ###### Forbenius Norm
# To make this error term between 0 and 1 , we divide it by the Frobenius norm of the original data Xnorm. Frobenius norm of a matrix is defined as the square root of the sum of the absolute squares of its elements.
#
# In python, frobenius norm is implemented in linear algebra package of numpy. You can call it using linalg.norm(, 'fro')
#%%
recon_error = np.linalg.norm(Xnorm-X_rec, 'fro')/np.linalg.norm(Xnorm, 'fro')
print("The reconstruction error is: {}".format(recon_error))
#%% [markdown]
# ### PCA Using Inbuilt Modules
#%% [markdown]
# ##### Import data
#%%
newData = pd.read_csv('/Users/ragyibrahim/Downloads/5DGaussian.csv', delimiter= ',', header=None).values
print(newData[0:5,:])
print(newData.shape)
#%% [markdown]
# #### Normalise data
#%%
from sklearn.preprocessing import scale
Xnorm_py = scale(newData)
#%% [markdown]
# #### Implement PCA
#%%
from sklearn.decomposition import PCA
newDataPca = PCA(n_components=5)
newDataPca.fit(Xnorm_py)
#%% [markdown]
# #### Dimensionality Reduction
#%% [markdown]
# ###### Dimensionality Reduction: How to choose the dimensions to keep
# To put it differently, how can we choose the number of principal components to retain? We can decide this by looking at the variance captured by each principle component.
#%%
#Find the variance captured by each Principle Axis
newDataVar = newDataPca.explained_variance_ratio_
print(newDataVar)
#%% [markdown]
# From the above array, we see that the first component captures around 84% variance, second component captures around 10% variance and so on. To make it much easier, we can calculate the cumulative variance:
#%%
newDataCumVar = np.cumsum(np.round(newDataVar, decimals=4)*100)
print(newDataCumVar)
plt.plot(newDataCumVar)
plt.xlabel("Number of Principle Axis")
plt.ylabel("Cumilative Variance Captured")
#%% [markdown]
# So, if k is the number of principal components, we see that k=1 captures around 84% variance, k=2 (the first 2 components together) capture around 94% variance and so on. Since k=2 captures more than 90% variance in our data, lets drop the third, fourth and fifth components.
#%%
newDataPca2 = PCA(n_components=2)
newDataTrans = pca.fit_transform(Xnorm_py) #this fits the models and applies the reduction
print(Zred.shape)
#%% [markdown]
# #### Compute reconstruction error
#%% [markdown]
# We can recreate our original data (Xrec) from the reduced data (Zred) using the inverse_transform() function, and calculate the reconstruction error as before.
#%%
newDataRcov = pca.inverse_transform(newDataTrans)
print(newDataRcov.shape)
#%% [markdown]
# #### Influence of Dimensionality Reduction on Reconstruction error
#%% [markdown]
# Let us see how dropping the dimensionality of data affects the reconstruction error. We perform PCA using increasing number of principal components, and measure the reconstruction error in each case.
#%%
nSamples, nDims = Xnorm_py.shape
# vary principal components from 1 to 5
n_comp = range(1,nDims+1)
print(n_comp)
#%%
# Initialize vector of rec_error
rec_error = np.zeros(len(n_comp)+1)
for k in n_comp:
pca = PCA(n_components=k)
Zred = pca.fit_transform(Xnorm_py)
Xrec = pca.inverse_transform(Zred)
rec_error[k] = np.linalg.norm(Xnorm_py-Xrec, 'fro')/np.linalg.norm(Xnorm_py, 'fro')
print("k={}, rec_error={}".format(k, rec_error[k]))
rec_error = rec_error[1:] #we started recording from index 1, so drop index 0
#Visualize the change in error
plt.plot(n_comp,rec_error)
plt.xlabel('No of principal components (k)')
plt.ylabel('Reconstruction Error')