-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathDSM_Project_DW.Rmd
1077 lines (779 loc) · 42.7 KB
/
DSM_Project_DW.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Introduction to Digital Soili Mapping"
subtitle: "Post-course Project: Predicting Soil Classes"
author: "Dave White"
date: "May 3, 2018"
output: "html_document"
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
knitr::opts_knit$set(root.dir = "C:/Essex/modeling")
```
### 1. Objective
##### My objective is to: (describe specific aim of project and the target soil classes) *You may choose to explore your data in Steps 2 and 3 below before deciding on a final objective*
My objective for this project is to use random forest to predictively model soil classes for the Essex County, VT data set. I will explorer the data provided, and create additional data as needed. I will then evaluate statistical data reduction techniques by evaluating model performance in an iterative approach to try and maximize model accuracy and minimize OOB estimates of error. Once a final selection of covariates is made, I will use the selected covariates to employ a cost- constrained cLHS to generate a set of sampling points. Finally, I will make model predictions and analyze the outputs.
* note - throughout this document some lines of code were commented out using #, to reduce the length of the final document. These lines of code are mostly reviewing and inspecting data objects.
```{r setupenvironment}
# Setting up the r environment
# Load the required libraries
library(sp)
library(maps)
library(rgdal)
library(raster)
library(caret)
library(maptools)
library(doParallel)
library(googleVis)
library(psych)
library(clhs)
library(ggplot2)
library(gridExtra)
library(grid)
library(corrplot)
# Set working directory
setwd("C:/Essex/modeling")
```
### 2. Data Provided
| Point Data | Raster Data | Other Data |
|--------------|---------------|--------------|
| Essex_field_points:<br> -soil class<br>-depth to redox (cm)<br>-depth to densic (cm)<br>-location (x,y)<br><br>Pre-split training/test data:<br>-Training data pedons<br>-Test/Validation data pedons|Terrain Derivatives:<br>-Hillshade and Shaded Relief for visualization<br>-Elevation<br>-Slope(percent) 30 and 60m neighborhood<br>-Profile curvature<br>-Planform curvature<br>-Tangent curvature<br>-Curvature<br>-Min curvature<br>-Max Curvature<br>-Multipath wetness index smoothed<br>-Total solar insolation<br>-Multiresolution valley bottom flatness<br>-Terrain ruggedness index<br>-Depression cost surface<br>-Surface area factor<br><br>Spectral Derivatives:<br>-none|Soil OSDs for reference |
I decided to pull in the field points and view them on a Google map to familiarize myself with the project area.
```{r pointData, results='asis'}
# setting options for google map plot
options(gvis.plot.tag = 'chart')
# load pedon data and inspect locations
comp <- readShapePoints("comp.shp")
# inspecting shapefile
head(comp)
summary(comp)
names(comp)
# look up coordinate system - empty
comp@proj4string
# prj2epsg.org - used this site to find the CRS code for the projection listed in ArcGIS
# Assign projection
proj4string(comp) <- CRS("+init=epsg:32145")
# inspect proojection
comp@proj4string
# convert projection to use with google maps
comp.2 <- spTransform(comp, CRS('+proj=longlat +datum=WGS84'))
# convert to data frame
comp.3 <- as.data.frame(comp.2)
# organizing lat and long for the google map
latlong <- paste(comp.3$coords.x2, comp.3$coords.x1, sep = ":")
class <- comp.3$FEATURE
comp.4 <- cbind.data.frame(class, latlong)
# create google map and store to object
sites <- gvisMap(comp.4, locationvar = "latlong", tipvar = "class",
options = list(mapType = 'terrain', useMapTypeControl = TRUE,
enableScrollWheel = TRUE,
width = "900px", height = "500px"))
# plot google map
plot(sites)
```
```{r gc, echo=FALSE, include=FALSE}
gc()
```
### 3. Covariates
##### Covariates - Identify the characteristics of the study area that can be represented by spectral, terrain, or other ancillary data. Use the SCORPAN framework to organize important study area characteristics and existing data layers that might be used to represent these characteristics. Depending on your objective, it may not be necessary to consider all SCORPAN factors.
| SCORPAN<br>Covariate | Soil-Landscape Characteristics | Existing Data Layers<br>that Represent<br> Soil-Landscape Characteristics|
|----------------------------|--------------------------------------|--------------------------------|
|Soil|Soils with redox features|multipath wetness index smoothed|
|Climate|Moist Environment|-Total Solar insolation<br>-Multipath Wetness Index|
|Organisms|-Substantial Vegetation coverage<br>-Soils high in organic matter|-none listed|
|Relief| -Glacial till<br>-Drumlins<br>-hills, mountains - uplands and lowlands|DEM Derivatives:<br>-Elevation<br>-Slope<br>-Curvatures(all)<br>-Multipath wetness index<br>-Multiresolution valley bottom flatness<br>-Terrain ruggedness index<br>-Depression cost raster<br>-Surface Area Factor|
|Parent Material| Glacial till| -none listed|
|Age|-Geologic maps<br>-Geomorphic surfaces| -none listed|
|Location| Inherent within the data sets||
### 4. Covariates - Deriving Raster Layers
##### Do you need to create any additional data layers to represent the biophysical characteristics identified in question 3? If so, list the data layer and the biophysical characteristic it represents.
| Soil-Landscape Characteristic | New Data Layer | Steps to Create New Data Layer |
|-------------------------------------|----------------------|----------------------------------|
|Vegetation Density| - NDVI|-Download Landsat8 surface reflectance scene<br>-Compute NDVI in ERDAS|
|Organics and redox features|-Stream Power Index<br>-SAGA wetness index|-Use DEM to create covariates in SAGA GIS with default settings|
|soil mineralogy, redder surfaces, lighter darker surfaces|-clay mineralogy<br>-Fe Oxide index<br>-Mineralogy Index(3band)| - Download Landsat8 surface reflectance scene <br>-Compute indices in ERDAS|
The final step in processing was to set all rasters to the same size (resolution), pixel alignment, and extent. This was accomplished using a batch project function in ArcGIS, with snap raster, mask, extent, projection and cell size set to match the 5m DEM provided. Batch project was utilized because during review of the covariate data, MRVBF was found to have some projection issues, not allowing it to be combined in a raster stack in R.
### 5. Covariate Selection
##### Based on the SCORPAN factors and data listed above, what covariate set did you choose for modeling? Did you use a particular covariate selection method? List covariates and describe selection method below. *You may choose to use your expert knowledge and understanding of soil-landscape relationships in lieu of a quantitative method because those weren't comprehensively covered in this intro course*
Random forest models are fairly robust, thus initial modeling was accomplished by including all of the provided covariates plus the additional covariates created. Then several statistical techniques were used to filter down the covariates to increase model accuracy and decrease the out of bag error rate.
Statistical Covariate Reduction Methods:
Near Zero Variance:
This method computes the variance of each covariate and returns those that are very near or equal to zero, thus removing data that does not have a high degree of variability. The selected covariates are then removed from the list.
Correlation Reduction:
This method generates a correlation matrix of the covariate set. A threshold is selected representing the degree of correlation allowed between the covariates. Those covariates with correlation higher than the selected threshold are removed from the list.
Recursive Feature Elimination:
This method runs multiple random forest models and keeps track of the variable importance during each iteration. The covariates that are repeatedly selected as "most" important are returned.
Model 1: This model serves as a baseline. All of the components are modeled within the Essex_field_points shapefile using all developed covariates.
```{r covariateReduction1}
# load Raster Data as a Raster Stack
# read in raster layers to create raster stack for random forest prediction
r.list=list.files(getwd(), pattern="tif$", full.names = FALSE)
r.list
#create raster stack
r.stack <- stack(r.list)
#remove files from working environment keeping only the raster stack and shapefile pedon locations (memory management)
to.remove <- ls()
matches <- c("r.stack", "comp$")
to.remove <- c(to.remove[!grepl(paste0(matches, collapse = "|"), to.remove)], "to.remove")
rm(list=to.remove)
#gc()
# set up dataframe for modeling
# first extract raster values from the raster stack by the points
# extract values from r.stack.all and set up a new dataframe
all.df <- extract(r.stack, comp, df=T, na.rm=TRUE)
# inspect dataframe
#head(all.df)
#names(all.df)
#summary(all.df)
#describe(all.df) # nice method for looking at stats including skewness from library(psych)
# add component information to the dataframe
all.df$class <- comp@data$FEATURE
# inspect dataframe
#head(all.df)
#names(all.df)
# prepare data for model training
# drop column ID, and Move colum Class to front
# get number of colums from dataframe
#length(all.df)
# output from above
#> length(all.df)
#[1] 26
# dropping column ID. use the length from above for the first number, subtract 1 for the second
comp.rf <- cbind(all.df[ ,26], data.frame(all.df[ ,2:25]))
# inspect comp
#names(comp.rf)
# rename first column
names(comp.rf)[c(1)] <- c('class')
# inspect comp, class should be the first column, followed by the 25 covariates
#names(comp.rf)
#summary(comp.rf)
# inspect naming convention for components
#unique(comp.rf$class)
# need to convert the component names so that there are no spaces
comp.rf$class <- make.names(comp.rf$class)
#unique(comp.rf$class)
# check the data type of the class field
#data.class(comp.rf$class)
# class is a charactor vector needs to be a factor
# converting class to a factor
comp.rf$class <- factor(comp.rf$class)
#data.class(comp.rf$class)
# Modeling the soil components by class using the Caret Package
#The following code allows you to set the seed to run RF from the caret pckage in parallel and get reproducable results
#The simulation will fit models with subset sizes of 50, 49...1
subsets <- c(1:50)
# set seeds to get reporducable results when running the process in parallel
set.seed(12)
seeds <- vector(mode = "list", length=76)
for(i in 1:75) seeds[[i]] <- sample.int(1000, length(subsets) + 1)
seeds[[76]] <- sample.int(1000, 1)
# in the Train control you can set up the number of folds and repeats, as well as the data splitting, in this instance I have p set to .7, 70 percent of the data will be used for training and 30 percent used for testing.
# set up the train control
# this is the only instance of fitControl, since it is used over and over keep in working environment
fitControl <- trainControl(method = "repeatedcv",
number = 15,
repeats = 5,
p = 0.7, #30% used for test set, 70% used for training set
selectionFunction = 'best',
classProbs = T,
savePredictions = T,
returnResamp = 'final',
search = "random",
seeds = seeds)
# Random Forest - Parallel process, highlight and run from c1 to stopCluster(c1)
c1 <- makeCluster(detectCores()-1, type='PSOCK')# utilizes all but 1 core
registerDoParallel(c1)
set.seed(48)
rfFit = train(class ~ ., data = comp.rf,
"rf",
trControl = fitControl,
ntree = 500, #number of trees default is 500, which seems to work best anyway.
tuneLength=10,
metric = 'Kappa',
na.action=na.pass,
keep.forest=TRUE, # added this line and the next for partial dependance plots
importance=TRUE)
stopCluster(c1)
#gc()
# inspect rfFit
rfFit
# accuracy 0.5606323
# Get print out of final model
rfFit$finalModel
#OOB estimate of error rate: 44.86%
```
```{r gc1, echo=FALSE, include=FALSE}
gc()
```
Exploration of the data revealed water features within the modeling area. Random forest will try to fit the soil components to the water features. To reduce class error and increase model performance, training points with the class of water were added to the training data set. This was accomplished by editing the Essex_field_points shapefile in ArcGIS using Landsat8 imagery to visually locate the water features and add training points in those locations. This method works best for those components, typically miscellaneous areas, which are easily mapped using imagery.
Model 2: This model is the same as the previous one, but with water features included.
```{r covariateReduction2}
# load component data including water component
comp2 <- readShapePoints("comp2.shp")
# inspecting shapefile
#head(comp2)
#summary(comp2)
#names(comp2)
# look up coordinate system
#comp2@proj4string
# prj2epsg.org
# Assign projection
proj4string(comp2) <- CRS("+init=epsg:32145")
# inspect proojection
#comp2@proj4string
# set up data frame for modeling
# first extract raster values from the raster stack by the points
# extract values from r.stack.all and set up a new dataframe
all.df2 <- extract(r.stack, comp2, df=T, na.rm=TRUE)
# inspect dataframe
#head(all.df2)
#names(all.df2)
#summary(all.df2)
#describe(all.df2) # nice method for looking at stats including skewness
# Add component information to the dataframe
all.df2$class <- comp2@data$FEATURE
# inspect dataframe
#head(all.df2)
#names(all.df2)
# prepare data for model training
# drop column ID, and Move colum class to front
# get number of colums from dataframe
#length(all.df2)
# output from above
#> length(all.df2)
#[1] 26
# dropping column ID. use the length from above for the first number, subtract 1 for the second
comp.rf2 <- cbind(all.df2[ ,26], data.frame(all.df2[ ,2:25]))
# inspect comp
#names(comp.rf2)
# rename first column
names(comp.rf2)[c(1)] <- c('class')
# inspect comp, class should be the first column, followed by the 25 covariates
#names(comp.rf2)
#summary(comp.rf2)
# inspect naming convention for components
#unique(comp.rf2$class)
# need to convert the component names so that there are no spaces.
comp.rf2$class <- make.names(comp.rf2$class)
#unique(comp.rf2$class)
# check the data type of the class field
#data.class(comp.rf2$class)
# class is a charactor vector needs to be a factor
# converting class to a factor
comp.rf2$class <- factor(comp.rf2$class)
#data.class(comp.rf2$class)
# Modeling the soil components by class using the Caret Package
# Random Forest - Parallel process, highlight and run from c1 to stopCluster(c1)
c1 <- makeCluster(detectCores()-1, type='PSOCK')# utilizes all but 1 core
registerDoParallel(c1)
set.seed(48)
rfFit2 = train(class ~ ., data = comp.rf2,
"rf",
trControl = fitControl,
ntree = 500, #number of trees default is 500, which seems to work best anyway.
tuneLength=10,
metric = 'Kappa',
na.action=na.pass,
keep.forest=TRUE, # added this line and the next for partial dependance plots
importance=TRUE)
stopCluster(c1)
#gc()
# Inspect rfFit
rfFit2
# accuracy 0.5827957
# Get print out of final model
rfFit2$finalModel
#OOB estimate of error rate: 44.33%
```
```{r gc2, echo=FALSE, include=FALSE}
gc()
```
Model performance increased slightly with the addition of the water components. The next step is to optimize model performance with the implementation of covariate reduction.
Model 3: This model is the same as the previous one, with the addition of covariate reduction by statistical means. The two methods that follow include near zero variance filtering and correlation reduction.
```{r covariateReduction3}
# convert raster stack to a data frame
stack.df <- as.data.frame(r.stack, na.rm=TRUE)
length(stack.df) # the number of covariates we have in the raster stack
# 1: Near Zero Variance Filtering
# find covaraiates with near zero variance
#
# removes covariates whose variance is near or equal to zero
# this process is parallelized to enable faster processing
cl <- makeCluster(detectCores()-1) # makes a cluster using all but 1 cpu cores
registerDoParallel(cl)
# the actual zeroVar function, creates a vector containing which covariates should be removed
zeroVar <- nearZeroVar(stack.df, foreach = TRUE, allowParallel = TRUE)
stopCluster(cl)
# removing unused items from memory
#gc()
# check the zeroVar object
head(zeroVar)
# *note if integer(0), means that there are no covariates to be removed
# remove covariates with zeroVar
stack.df <- if(length(zeroVar) > 0){
stack.df[, -zeroVar]
} else {
stack.df
}
length(stack.df) # the new number of covariates, zeroVar removed 1 covariate
# nearZero removed one covariate
names(stack.df) # curvature was removed
# 2. Correlation Reduction
# find correlations within covariates
# a correlation matrix is determined and covariates with high degree of correlation are returned
# create correlation matrix
cor.mat <- cor(stack.df)
# visually examin the correlation matrix
#print(cor.mat)
corrplot(cor.mat)
# many of the curverature covariates and slope derivatives are highly correlated
# I don't want to remove too many variables so I will set the threshold for highly correlated covariates high
# find high degree of correlation, cutoff is the threshold to set. If cutoff = 0.8 then covariates > or = 80 correlated are removed
highCorr <- findCorrelation(cor.mat, cutoff = 0.95)
# remove covariates with high degree of correlation
stack.df <- if(length(highCorr) > 0){
stack.df[, -highCorr]
} else {
stack.df
}
length(stack.df) # number of covariates left, corr reduction removed 3 covariates
names(stack.df)
# Model with remaining covairates
# Try running the model again with the the 20 covariates remaining
# set up data frame for modeling
# create a subset of the original raster stack
r.stack.2 <- subset(r.stack, names(stack.df))
names(r.stack.2)
# extract values from r.stack.2 and set up a new dataframe
all.df3 <- extract(r.stack.2, comp2, df=T, na.rm=TRUE)
# inspect dataframe
#head(all.df3)
#names(all.df3)
#describe(all.df3)
#summary(all.df3)
# add component information to the dataframe
all.df3$class <- comp2@data$FEATURE
# inspect dataframe
#head(all.df3)
#names(all.df3)
# prepare data for model training
# drop column ID, and Move colum class to front
# get number of colums from dataframe
#length(all.df3)
# output from above
#> length(all.df.2)
#[1] 22
# dropping column ID. use the length from above for the first number, subtract 1 for the second
comp.rf3 <- cbind(all.df3[ ,22], data.frame(all.df3[ ,2:21]))
# inspect comp
#names(comp.rf3)
# add class as the first column
names(comp.rf3)[c(1)] <- c('class')
# inspect comp, Class should be the first column, followed by the 43 covariates
#names(comp.rf3)
#describe(comp.rf3)
#summary(comp.rf3)
# need to convert the component names so that there are no spaces.
comp.rf3$class <- make.names(comp.rf3$class)
#data.class(comp.rf3$class)
# class is a charactor vector needs to be a factor
# converting class to a factor
comp.rf3$class <- factor(comp.rf3$class)
#data.class(comp.rf3$class)
# Modeling the soil components by class using the Caret Package
# Random Forest - Parallel process, highlight and run from c1 to stopCluster(c1)
c1 <- makeCluster(detectCores()-1, type='PSOCK')
registerDoParallel(c1)
set.seed(48)
rfFit3 = train(class ~ ., data = comp.rf3,
"rf",
trControl = fitControl,
ntree = 500, #number of trees default is 500, which seems to work best anyway.
tuneLength=10,
metric = 'Kappa',
na.action=na.pass,
keep.forest=TRUE, # added this line and the next for partial dependance plots
importance=TRUE)
stopCluster(c1)
#gc()
# 5.1.2 Inspect rfFit
rfFit3
# accuracy 0.5918594
# Get print out of final model
rfFit3$finalModel
#OOB estimate of error rate: 42.27%
```
```{r gc3, echo=FALSE, include=FALSE}
gc()
```
In model 3, the accuracy increased and OOB error rate decreased. There are several other methods for filtering down covariates. Next, Recursive Feature Elimination will be used to further reduce the covariate set.
Model 4: This model is the same as above, but with Recursive Feature Elimination (RFE) implemented.
```{r rfe}
# If data include levels that have only 1 cases, it does not work...
# Gives Error in { : task 3 failed - "Can't have empty classes in y."
# So, for RFE we will only use those classes that have more than one case per class
# Select levels that have more than 1 cases to proceed with the rfe
subset(table(comp.rf3$class), table(comp.rf3$class) > 1)
names(subset(table(comp.rf3$class), table(comp.rf3$class) > 1))
# Use this to get names of factors that have more than one case
cat(paste(shQuote(names(subset(table(comp.rf3$class), table(comp.rf3$class) > 1)), type="cmd"), collapse=", "))
# manually paste the result here and add the 'c(' and the ')' to bookend the list
# manually paste the new list here for subsetting the dataframe
comp.sub <- subset(comp.rf3, class %in% c("Brayton", "Cabot", "Colonel", "Dixfield", "Skerry", "water", "Wilmington"))
nrow(comp.sub)
# CRITICAL STEP! You must assign the new subset as a factor for this to work
comp.sub$class <- factor(comp.sub$class)
#data.class(comp.sub$class)
```
Run Recursive Feature Elimination. This selects the covariates that will build the best RF model
```{r rfe2}
# This process is setup to run as a parallel process. Set you number of cpu cores
# in the make cluster function. This example uses detect cores to use all available. To set
# the number of cores just type the number in place of detect cores(). Next, change the subsets
#The simulation will fit models with subset sizes of 43, 42, 41.....1.
subsets <- c(1:50)
# set seeds to get reporducable results when running the process in parallel
set.seed(12)
seeds <- vector(mode = "list", length=76)
for(i in 1:75) seeds[[i]] <- sample.int(1000, length(subsets) + 1)
seeds[[76]] <- sample.int(1000, 1)
# set up the rfe control
ctrl.RFE <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
number = 15,
repeats = 5,
seeds = seeds,
verbose = FALSE)
## highlight and run everything from c1 to stopCluster(c1) to run RFE
#gc()
c1 <- makeCluster(detectCores()-1, type='PSOCK')
registerDoParallel(c1)
set.seed(9)
rf.RFE <- rfe(x = comp.sub[,-1],
y = comp.sub$class,
sizes = subsets,
rfeControl = ctrl.RFE,
allowParallel = TRUE
)
stopCluster(c1)
#gc()
# Look at the results
rf.RFE
plot(rf.RFE, metric="Accuracy", main='RFE Accuracy')
# See list of predictors
predictors(rf.RFE)
# get list of covariates from rfe
predictors(rf.RFE)
# Create a Random Forest model utilizing the covariates selected using RFE
# create new raster stack to for model development from the predictors above
r.stack.rfe <- subset(r.stack.2, predictors(rf.RFE))
names(r.stack.rfe)
# extract data data from raster stack
# set up data frame for modeling
# first extract raster values from the raster stack by the points
# extract values from r.stack.all and set up a new dataframe
all.df4 <- extract(r.stack.rfe, comp2, df=T, na.rm=TRUE)
# inspect dataframe
#head(all.df4)
#names(all.df4)
#summary(all.df4)
#describe(all.df4) # nice method for looking at stats including skewness
# add component information to the dataframe
all.df4$class <- comp2@data$FEATURE
# inspect dataframe
#head(all.df4)
#names(all.df4)
# prepare data for feature selection and model training
# drop column ID, and Move colum class to front
# get number of colums from dataframe
#length(all.df4)
# output from above
#> length(all.df4)
#[1] 21
# dropping column ID. use the length from above for the first number, subtract 1 for the second
comp.rf4 <- cbind(all.df4[ ,21], data.frame(all.df4[ ,2:20]))
# inspect comp
#names(comp.rf4)
# rename first column
names(comp.rf4)[c(1)] <- c('class')
# inspect comp, class should be the first column, followed by the 25 covariates
#names(comp.rf4)
# inspect naming convention for components
#unique(comp.rf4$class)
# need to convert the component names so that there are no spaces.
comp.rf4$class <- make.names(comp.rf4$class)
#unique(comp.rf4$class)
# check the data type of the class field
#data.class(comp.rf4$class)
# class is a charactor vector needs to be a factor
# converting Class to a factor
comp.rf4$class <- factor(comp.rf4$class)
#data.class(comp.rf4$class)
# Modeling the soil components by class using the Caret Package
# The following code allows you to set the seed to run RF from the caret pckage in parallel and get reproducable results
#The simulation will fit models with subset sizes of 50, 49.....1.
subsets <- c(1:50)
# set seeds to get reporducable results when running the process in parallel
set.seed(12)
seeds <- vector(mode = "list", length=76)
for(i in 1:75) seeds[[i]] <- sample.int(1000, length(subsets) + 1)
seeds[[76]] <- sample.int(1000, 1)
# in the Train control you can set up the number of folds and repeats, as well as the data splitting, in this instance I have p set to .7, 70 percent of the data will be used for training and 30 percent used for testing.
# set up the train control
fitControl <- trainControl(method = "repeatedcv",
number = 15,
repeats = 5,
p = 0.7, #30% used for test set, 70% used for training set
selectionFunction = 'best',
classProbs = T,
savePredictions = T,
returnResamp = 'final',
search = "random",
seeds = seeds)
#Random Forest - Parallel process, highlight and run from c1 to stopCluster(c1)
c1 <- makeCluster(detectCores()-1, type='PSOCK')# utilizes all but 1 core
registerDoParallel(c1)
set.seed(48)
rfFit4 = train(class ~ ., data = comp.rf4,
"rf",
trControl = fitControl,
ntree = 500, #number of trees default is 500, which seems to work best anyway.
tuneLength=10,
metric = 'Kappa',
na.action=na.pass,
keep.forest=TRUE, # added this line and the next for partial dependance plots
importance=TRUE)
stopCluster(c1)
#gc()
# inspect rfFit
rfFit4
# accuracy 0.6006291
# get print out of final model
rfFit4$finalModel
#OOB estimate of error rate: 43.3%
```
```{r gc4, echo=FALSE, include=FALSE}
gc()
```
Covariate reduction methods such as nearZero variance, correlation reduction, and recursive feature elimination provide useful means to narrowing a list of covariates to those that contribute the most amount of information without being redundant. A combination of these methods helped to reduce class confusion, and increase overall model accuracy. However, this project could use more sample locations, to improve model accuracy. There may be components not yet described. Some of the components with only one location, may need additional sampling, or correlation to another component.
### 6. Sampling
Use the selected set of covariates from Step (5) above and run conditioned Latin hypercube sampling (cLHS) in the TEUI toolkit as in Exercise 3.4. Display output cLHS points in a map and submit with project material.
At this point in the project, it appears that additional sampling points are necessary to boost model performance. cLHS will be used to accomplish the task of generating additional sampling points. Additionally, cost constrained cLHS will be used to target sampling areas to those where the model performs poorly. This will be accomplished by generating a probability raster from Model 4, above to use as a cost raster for cLHS.
```{r clhs}
# utilize the best RF model to generate a probability raster
ess.prob <- predict(r.stack.rfe, rfFit4, type = "prob")
#, progress = "text")
# plot prbability raster
par(mar = c(1,1,1,1))
plot(ess.prob, axes = FALSE,
main = "Init Samples on Probability / Cost Raster")
points(comp2, col = 'red', pch = 19)
# we will use the ess.prob layer as our cost raster it needs to be added to the raster stack
# change name of prob raster and add to stack
r.stack.rfe$cost <- ess.prob
names(r.stack.rfe)
# convert the raster stack to a SpatialPointsDataFrame
# during this process every pixel becomes a point in a dataframe. WARINING: if you have a very large dataset this might not be appropriate as it could take a long time.
s <- rasterToPoints(r.stack.rfe, spatial = TRUE)
set.seed(10) # for reproducability
s.clhs <- clhs(s, size = 185, progress = FALSE, iter = 1500, cost = 'cost', simple = FALSE)
# diagnostic plots
plot(s.clhs, mode = c("obj"))#, "box", "dens")) # we can plot the box plots and density plots too
# extract indices
subset.idx <- s.clhs$index_samples
# plot points to inspect and save to shp fil
# check visually on the sagawi raster
par(mar = c(1,1,1,1))
plot(r.stack.rfe$sagawi, axes = FALSE, pch = 19,
main = "cLHS points plotted on SAGA Wetness Index")
points(s[subset.idx, ], col = 'blue')
# check visualy on the cost raster
par(mar = c(1,1,1,1))
plot(r.stack.rfe$cost, axes=FALSE,
main = "cLHS vs Initial Sample Points on Cost Raster")
points(s[subset.idx, ], col = 'blue', pch = 19 )
points(comp2, col = 'red', pch = 19 )
legend("topleft", inset = c(0.25, 0.01), legend = c("cLHS", "Initial Sample"), col = c("blue", "red"), pch = 19, cex = 0.8, title = "sample type", text.font = 4)
# save cLHS points to shp
writeOGR(s[subset.idx, ], dsn = 'C:/Essex', layer = 'clhs_points_cost', driver = 'ESRI Shapefile', overwrite_layer = TRUE)
```
```{r gc5, echo=FALSE, include=FALSE}
gc()
```
### 7. Prepare Training data for modeling
Note - I accomplished this task above in section 5.
### 8. Sampling/Data Exploration
Use the training point data output from Step (7) and plot it against a covariate distribution using this R script. Do the same for the cLHS point output from Step (7). Submit density and box plots with your project results. Which points seem to represent the distribution of the covariate better, cLHS or provided points? Discuss why.
```{r dataExploration}
# data exploration: compare your cLHS point output and your training point data to a covariate layer using density and box plots
# setup
#checking to see what the most important variables were for RF
rfIMP <- varImp(rfFit4, scale = FALSE)
plot(rfIMP)
# seting up data
raster_data <- as.data.frame(r.stack.rfe)
#names(raster_data)
raster_data <- raster_data[, -20]
raster_data$pt.type <- rep("Raster", nrow(raster_data))
#names(raster_data)
clHS_points <- as.data.frame(s[subset.idx, ])
#names(clHS_points)
clHS_points <- clHS_points[, -c(20:22)]
#names(clHS_points)
clHS_points$pt.type <- rep("cLHS", nrow(clHS_points))
init_points <- extract(r.stack.rfe, comp, df = TRUE, na.rm = TRUE)
#names(init_points)
init_points <- init_points[, -c(1,21)]
#names(init_points)
init_points$pt.type <- rep("Init", nrow(init_points))
# set up new data frame for density and box plots
plt.dat <- rbind(raster_data, clHS_points, init_points)
# save data in workspace
save.image(file = "data.RData")
# remove files from working environment keeping only the raster stack and shapefile pedon locations(memory handling)
to.remove <- ls()
matches <- c("plt.dat")
to.remove <- c(to.remove[!grepl(paste0(matches, collapse = "|"), to.remove)], "to.remove")
rm(list=to.remove)
#gc()
```
```{r gc6, echo=FALSE, include=FALSE}
gc()
```
```{r plots}
# create density plots
fill <- plt.dat$pt.type
fill.f <- as.factor(plt.dat$pt.type)
line <- "#1F3552"
# MRVBF Density Plot
mrvbf.den <- ggplot(data = plt.dat, aes(x = esx_mrvbf))+
stat_density(aes(ymax = ..density.., ymin = -..density..,
fill = pt.type, color = pt.type),
geom = "ribbon", alpha = 0.6, position = "identity")+
facet_grid(. ~pt.type)+
scale_x_continuous(name = "Covariate : MRVBF",
breaks = seq(0, 6, 2),
limits = c(0,6))+
scale_y_continuous(name = "Density",
breaks = seq(-1,1,1),
limits = c(-1,1)) +
ggtitle("Volcano Density Plot Comprisons")+
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(face = "bold"), legend.position = "none")+
coord_flip()
# Wetness Index Density Plot
wet.den <- ggplot(data = plt.dat, aes(x = esx_mpwetsm))+
stat_density(aes(ymax = ..density.., ymin = -..density..,
fill = pt.type, color = pt.type),
geom = "ribbon", alpha = 0.6, position = "identity")+
facet_grid(. ~pt.type)+
scale_x_continuous(name = "Covariate : MPWETSM",
breaks = seq(0, 10, 2),
limits = c(0,10))+
scale_y_continuous(name = "Density",
breaks = seq(-.5,.5,.5),
limits = c(-.5,.5)) +
ggtitle("")+
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(face = "bold"), legend.position = "none")+
coord_flip()
# Slope60 Density Plot
slp60.den <- ggplot(data = plt.dat, aes(x = esx_slope60))+
stat_density(aes(ymax = ..density.., ymin = -..density..,
fill = pt.type, color = pt.type),
geom = "ribbon", alpha = 0.6, position = "identity")+
facet_grid(. ~pt.type)+
scale_x_continuous(name = "Covariate : Slope60",
breaks = seq(-5, 50, 10),
limits = c(-5, 50))+
scale_y_continuous(name = "Density",
breaks = seq(-2,2,.2),
limits = c(-.3,.3)) +
ggtitle("")+
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(face = "bold"), legend.position = "none")+
coord_flip()
# create box plots
# MRVBF box plot
mrvbf.box <- ggplot(plt.dat, aes(x = plt.dat$pt.type, y= plt.dat$esx_mrvbf, fill = plt.dat$pt.type))+
geom_boxplot(alpha = 0.6)+
ggtitle("Box Plot Comparisons")+
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none")+
xlab("")+
ylab("")
# wetness index box plot
wet.box <- ggplot(plt.dat, aes(x = plt.dat$pt.type, y= plt.dat$esx_mpwetsm, fill = plt.dat$pt.type))+
geom_boxplot(alpha = 0.6)+
ggtitle("")+
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none")+
xlab("")+
ylab("")
# slope 60 box plot
slp60.box <- ggplot(plt.dat, aes(x = plt.dat$pt.type, y= plt.dat$esx_slope60, fill = plt.dat$pt.type))+
geom_boxplot(alpha = 0.6)+
ggtitle("")+
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none")+
xlab("")+
ylab("")
# combine all of the plots into one figure
grid.arrange(mrvbf.den, mrvbf.box,
wet.den, wet.box,
slp60.den, slp60.box,
nrow = 3,
top = "Visualizing Distributions")
```
```{r gc7, echo=FALSE, include=FALSE}
# Clear Workspace and bring back in other data
rm(list = ls())