In this report, I will compare four models on their rate of successful classification of correctly/incorrectly executed exercises. The data is extracted from a previous research study, the details of which are here. In short, six participants performed an exercise in six different but predefined manners classified from A (perfectly) through B, C, D, and E with the non-A classes being “imperfect” in different discrete ways.
The null hypothesis is that the recorded exercise execution data is NOT significantly descriptive such that the sensor recordings can sufficiently inform a discriminant model to classify the execution “classe” of the exercise observations after the fact.
library(caret)
library(gbm)
library(AppliedPredictiveModeling)
library(rpart)
library(doParallel)
#Find out how many cores are available (if you don't already know)
cores<-detectCores()
#Create cluster with desired number of cores, leave one open for the machine
#core processes
cl <- makeCluster(cores[1]-1)
#Register cluster
registerDoParallel(cl)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
seed <- 7
set.seed(seed)
metric <- "Accuracy"
training <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
projectTesting <- read.csv("pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
# CROSS VALIDATION
splitTrain <- createDataPartition(y=training$classe, p=0.8, list=FALSE)
myTraining <- training[splitTrain, ]
myTesting <- training[-splitTrain, ]
# Remove columns that are majority (60%) NA
myTraining <- myTraining[,colSums(is.na(myTraining))<nrow(myTraining) * 0.6]
myTesting <- myTesting[,colSums(is.na(myTesting))<nrow(myTesting) * 0.6]
projectTesting <- projectTesting[,colSums(is.na(projectTesting))<nrow(projectTesting) * 0.6]
# Discover which columns have near-zero-variance
myDataNZV <- nearZeroVar(myTraining, saveMetrics=TRUE)
myDataNZV <- subset(myDataNZV,nzv==TRUE)
removeColNames <- rownames(myDataNZV)
# Remove the near-zero-variance columns from the training, testing, and validation datasets
myTraining <- myTraining[, -which(names(myTraining) %in% removeColNames)]
myTraining <- myTraining[, 7:59]
myTesting <- myTesting[, -which(names(myTesting) %in% removeColNames)]
myTesting <- myTesting[, 7:59]
projectTesting <- projectTesting[, 8:60]
Trying four model types:
## Accuracy
## 0.9678817
## Accuracy
## 0.994647
## Accuracy
## 0.9887841
## Accuracy
## 0.4970686
Boosting | Random Forest | Bagging | Trees |
---|---|---|---|
0.7288362 | 0.7265431 | 0.7273075 | 0.749156 |
##
## Call:
## summary.resamples(object = results)
##
## Models: rpart, gbm, bagging, rf
## Number of resamples: 28
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rpart 0.4726463 0.4948271 0.5038241 0.5093670 0.5121796 0.5862508 0
## gbm 0.9554140 0.9598918 0.9614652 0.9618720 0.9636827 0.9700637 0
## bagging 0.9789675 0.9858257 0.9888464 0.9879427 0.9904519 0.9936306 0
## rf 0.9904398 0.9929936 0.9942657 0.9939715 0.9950637 0.9968153 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rpart 0.3099075 0.3390570 0.3523341 0.3616006 0.3629675 0.4801959 0
## gbm 0.9435532 0.9492439 0.9512649 0.9517601 0.9540435 0.9621205 0
## bagging 0.9733883 0.9820684 0.9858907 0.9847473 0.9879224 0.9919434 0
## rf 0.9879058 0.9911373 0.9927467 0.9923743 0.9937552 0.9959719 0
From the results above, we choose Random Forests as the best predictor model. We get more detail, especially the expected in-sample error rate of 0.57% and the estimated successful prediction rate of 99.43%. We reject the null hypothesis that the data cannot be successfully re-classified. We have strong confidence that the recorded measurements can indeed be used to classify successful or unsuccessful execution of the exercise.
Finally, we use the model to predict the “classe” values for the originally provided test data which we reserved for final validation. Our out-of-sample error rate is 0.51% and our estimated accuracy is 99.49%.
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE, verbose = FALSE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.51%
## Confusion matrix:
## A B C D E class.error
## A 4460 2 1 0 1 0.0008960573
## B 17 3013 8 0 0 0.0082290981
## C 0 8 2718 12 0 0.0073046019
## D 0 1 17 2553 2 0.0077730276
## E 0 1 4 6 2875 0.0038115038
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E