## Synopsis

In this report, I will compare four models on their rate of successful classification of correctly/incorrectly executed exercises. The data is extracted from a previous research study, the details of which are here. In short, six participants performed an exercise in six different but predefined manners classified from A (perfectly) through B, C, D, and E with the non-A classes being “imperfect” in different discrete ways.

The null hypothesis is that the recorded exercise execution data is NOT significantly descriptive such that the sensor recordings can sufficiently inform a discriminant model to classify the execution “classe” of the exercise observations after the fact.

## Initializing the required libraries and control parameters

        library(caret)
library(gbm)
library(AppliedPredictiveModeling)
library(rpart)
library(doParallel)
#Find out how many cores are available (if you don't already know)
cores<-detectCores()
#Create cluster with desired number of cores, leave one open for the machine
#core processes
cl <- makeCluster(cores[1]-1)
#Register cluster
registerDoParallel(cl)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
seed <- 7
set.seed(seed)
metric <- "Accuracy"

        training <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
# CROSS VALIDATION
splitTrain <- createDataPartition(y=training$classe, p=0.8, list=FALSE) myTraining <- training[splitTrain, ] myTesting <- training[-splitTrain, ] ## Data Cleaning  # Remove columns that are majority (60%) NA myTraining <- myTraining[,colSums(is.na(myTraining))<nrow(myTraining) * 0.6] myTesting <- myTesting[,colSums(is.na(myTesting))<nrow(myTesting) * 0.6] projectTesting <- projectTesting[,colSums(is.na(projectTesting))<nrow(projectTesting) * 0.6] # Discover which columns have near-zero-variance myDataNZV <- nearZeroVar(myTraining, saveMetrics=TRUE) myDataNZV <- subset(myDataNZV,nzv==TRUE) removeColNames <- rownames(myDataNZV) # Remove the near-zero-variance columns from the training, testing, and validation datasets myTraining <- myTraining[, -which(names(myTraining) %in% removeColNames)] myTraining <- myTraining[, 7:59] myTesting <- myTesting[, -which(names(myTesting) %in% removeColNames)] myTesting <- myTesting[, 7:59] projectTesting <- projectTesting[, 8:60] ## Training the individual models Trying four model types: 1. Boosting ## Accuracy ## 0.9678817 1. Random Forests ## Accuracy ## 0.994647 1. Bagging ## Accuracy ## 0.9887841 1. Trees ## Accuracy ## 0.4970686 ## Results Misclassification Error - by model type. Boosting Random Forest Bagging Trees 0.7288362 0.7265431 0.7273075 0.749156 ## ## Call: ## summary.resamples(object = results) ## ## Models: rpart, gbm, bagging, rf ## Number of resamples: 28 ## ## Accuracy ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## rpart 0.4726463 0.4948271 0.5038241 0.5093670 0.5121796 0.5862508 0 ## gbm 0.9554140 0.9598918 0.9614652 0.9618720 0.9636827 0.9700637 0 ## bagging 0.9789675 0.9858257 0.9888464 0.9879427 0.9904519 0.9936306 0 ## rf 0.9904398 0.9929936 0.9942657 0.9939715 0.9950637 0.9968153 0 ## ## Kappa ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## rpart 0.3099075 0.3390570 0.3523341 0.3616006 0.3629675 0.4801959 0 ## gbm 0.9435532 0.9492439 0.9512649 0.9517601 0.9540435 0.9621205 0 ## bagging 0.9733883 0.9820684 0.9858907 0.9847473 0.9879224 0.9919434 0 ## rf 0.9879058 0.9911373 0.9927467 0.9923743 0.9937552 0.9959719 0 ## Plot the models’ accuracy From the results above, we choose Random Forests as the best predictor model. We get more detail, especially the expected in-sample error rate of 0.57% and the estimated successful prediction rate of 99.43%. We reject the null hypothesis that the data cannot be successfully re-classified. We have strong confidence that the recorded measurements can indeed be used to classify successful or unsuccessful execution of the exercise. ## Predicting against the validation dataset Finally, we use the model to predict the “classe” values for the originally provided test data which we reserved for final validation. Our out-of-sample error rate is 0.51% and our estimated accuracy is 99.49%. ## ## Call: ## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE,      verbose = FALSE)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
##
##         OOB estimate of  error rate: 0.51%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4460    2    1    0    1 0.0008960573
## B   17 3013    8    0    0 0.0082290981
## C    0    8 2718   12    0 0.0073046019
## D    0    1   17 2553    2 0.0077730276
## E    0    1    4    6 2875 0.0038115038
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E