Model Selection for Exercise Classification

Synopsis

In this report, I will compare four models on their rate of successful classification of correctly/incorrectly executed exercises. The data is extracted from a previous research study, the details of which are here. In short, six participants performed an exercise in six different but predefined manners classified from A (perfectly) through B, C, D, and E with the non-A classes being “imperfect” in different discrete ways.

The null hypothesis is that the recorded exercise execution data is NOT significantly descriptive such that the sensor recordings can sufficiently inform a discriminant model to classify the execution “classe” of the exercise observations after the fact.

Initializing the required libraries and control parameters

        library(caret)
        library(gbm)
        library(AppliedPredictiveModeling)
        library(rpart)
        library(doParallel)
        #Find out how many cores are available (if you don't already know)
        cores<-detectCores()
        #Create cluster with desired number of cores, leave one open for the machine         
        #core processes
        cl <- makeCluster(cores[1]-1)
        #Register cluster
        registerDoParallel(cl)
        control <- trainControl(method="repeatedcv", number=10, repeats=3)
        seed <- 7
        set.seed(seed)
        metric <- "Accuracy"

Loading and Partitioning the Raw Data for Cross-Validation

        training <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
        projectTesting <- read.csv("pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
        # CROSS VALIDATION
        splitTrain <- createDataPartition(y=training$classe, p=0.8, list=FALSE)
        myTraining <- training[splitTrain, ]
        myTesting <- training[-splitTrain, ]

Data Cleaning

        # Remove columns that are majority (60%) NA
        myTraining <- myTraining[,colSums(is.na(myTraining))<nrow(myTraining) * 0.6]
        myTesting <- myTesting[,colSums(is.na(myTesting))<nrow(myTesting) * 0.6]
        projectTesting <- projectTesting[,colSums(is.na(projectTesting))<nrow(projectTesting) * 0.6]
        
        # Discover which columns have near-zero-variance
        myDataNZV <- nearZeroVar(myTraining, saveMetrics=TRUE)
        myDataNZV <- subset(myDataNZV,nzv==TRUE)
        removeColNames <- rownames(myDataNZV)
        
        # Remove the near-zero-variance columns from the training, testing, and validation datasets
        myTraining <- myTraining[, -which(names(myTraining) %in% removeColNames)]
        myTraining <- myTraining[, 7:59]
        myTesting <- myTesting[, -which(names(myTesting) %in% removeColNames)]
        myTesting <- myTesting[, 7:59]
        projectTesting <- projectTesting[, 8:60]

Training the individual models

Trying four model types:

Boosting

##  Accuracy 
## 0.9678817

Random Forests

## Accuracy 
## 0.994647

Bagging

##  Accuracy 
## 0.9887841

Trees

##  Accuracy 
## 0.4970686

Results

Misclassification Error - by model type.
Boosting	Random Forest	Bagging	Trees
0.7288362	0.7265431	0.7273075	0.749156

## 
## Call:
## summary.resamples(object = results)
## 
## Models: rpart, gbm, bagging, rf 
## Number of resamples: 28 
## 
## Accuracy 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rpart   0.4726463 0.4948271 0.5038241 0.5093670 0.5121796 0.5862508    0
## gbm     0.9554140 0.9598918 0.9614652 0.9618720 0.9636827 0.9700637    0
## bagging 0.9789675 0.9858257 0.9888464 0.9879427 0.9904519 0.9936306    0
## rf      0.9904398 0.9929936 0.9942657 0.9939715 0.9950637 0.9968153    0
## 
## Kappa 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rpart   0.3099075 0.3390570 0.3523341 0.3616006 0.3629675 0.4801959    0
## gbm     0.9435532 0.9492439 0.9512649 0.9517601 0.9540435 0.9621205    0
## bagging 0.9733883 0.9820684 0.9858907 0.9847473 0.9879224 0.9919434    0
## rf      0.9879058 0.9911373 0.9927467 0.9923743 0.9937552 0.9959719    0

Plot the models’ accuracy

From the results above, we choose Random Forests as the best predictor model. We get more detail, especially the expected in-sample error rate of 0.57% and the estimated successful prediction rate of 99.43%. We reject the null hypothesis that the data cannot be successfully re-classified. We have strong confidence that the recorded measurements can indeed be used to classify successful or unsuccessful execution of the exercise.

Predicting against the validation dataset

Finally, we use the model to predict the “classe” values for the originally provided test data which we reserved for final validation. Our out-of-sample error rate is 0.51% and our estimated accuracy is 99.49%.

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE,      verbose = FALSE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.51%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4460    2    1    0    1 0.0008960573
## B   17 3013    8    0    0 0.0082290981
## C    0    8 2718   12    0 0.0073046019
## D    0    1   17 2553    2 0.0077730276
## E    0    1    4    6 2875 0.0038115038

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E