Topic 3 Overfitting
Learning Goals
- Explain why training/in-sample model evaluation metrics can provide a misleading view of true test/out-of-sample performance
- Implement testing and training sets in R using the
tidymodels
package
Slides from today are available here.
The tidymodels
package
(If you have not already installed the tidymodels
package, install it with install.packages("tidymodels")
.)
Over this course, we will looking at a broad but linked set of specialized tools applied in statistical machine learning. Specialized tools generally require specialized code.
Each tool has been developed separately and coded in a unique way. In order to facilitate and streamline the user experience, there have been attempts at creating a uniform interface, such as the caret
R package. The developers of the caret
package are no longer maintaining those packages. They are working on a newer package, called tidymodels
.
In this class, we will use the tidymodels
package, which uses the tidyverse
syntax you learned in Stat 155. The tidymodels
package is a relatively new package and continues to be developed as we speak. This means that I’m learning with you and in a month or two, there may be improved functionality.
As Prof. Heggeseth introduced in the R code videos, we have a general workflow structure that includes a model specification and a recipe (formula + preprocessing steps).
# Load the package
library(tidymodels)
tidymodels_prefer()
# Set the seed for the random number generator
set.seed(123)
# Specify Model
<-
model_spec linear_reg() %>% # type of model
set_engine(engine = ____) #%>% # algorithm to fit the model
set_args(__) %>% # hyperparameters/tuning parameters are needed for some models
set_mode(__) # regression or classification
# Specify Recipe (if you have preprocessing steps)
<- recipe(formula, data) %>%
rec %>% # e.g., step_filter() to subset the rows
step_{FILLIN}() # e.g., step_lincomb() to remove all predictors which are perfect linear combinations of another
step_{FILLIN}()
# Create Workflow (Model + Recipe)
<- workflow() %>%
model_wf add_recipe(rec) %>% #or add_formula()
add_model(model_spec)
We can fit that workflow to training data.
# Fit Model to training data (without a recipe)
<- fit(model_spec, formula, data_train)
fit_model
# Fit Model & Recipe to training data
<- fit(model_wf, data_train) fit_model
And then we can evaluate that fit model on testing data (new data that has not been used to fit the model).
# Evaluate on testing data
<- fit_model %>%
model_output predict(new_data = data_test) %>% # this function will apply recipe to new_data and do prediction
bind_cols(data_test)
<- metric_set(rmse, rsq, mae)
reg_metrics
%>%
model_output reg_metrics(truth = __, estimate = .pred)
The power of tidymodels
is that it allows us to streamline the vast world of machine learning techniques into one common syntax. On top of "lm"
, there are many other different machine learning methods that we can use.
In the exercises below, you’ll need to adapt the code above to fit a linear regression model (engine = "lm"
).
Exercises
You can download a template RMarkdown file to start from here.
Context
We’ll be working with a dataset containing physical measurements on 80 adult males. These measurements include body fat percentage estimates as well as body circumference measurements.
fatBrozek
: Percent body fat using Brozek’s equation: 457/Density - 414.2fatSiri
: Percent body fat using Siri’s equation: 495/Density - 450density
: Density determined from underwater weighing (gm/cm^3).age
: Age (years)weight
: Weight (lbs)height
: Height (inches)neck
: Neck circumference (cm)chest
: Chest circumference (cm)abdomen
: Abdomen circumference (cm)hip
: Hip circumference (cm)thigh
: Thigh circumference (cm)knee
: Knee circumference (cm)ankle
: Ankle circumference (cm)biceps
: Biceps (extended) circumference (cm)forearm
: Forearm circumference (cm)wrist
: Wrist circumference (cm)
It takes a lot of effort to estimate body fat percentage accurately through underwater weighing. The goal is to build the best predictive model for fatSiri
using just circumference measurements, which are more easily attainable. (We won’t use fatBrozek
or density
as predictors because they’re other outcome variables.)
library(readr)
library(ggplot2)
library(dplyr)
library(broom)
library(tidymodels)
<- read_csv("https://www.dropbox.com/s/js2gxnazybokbzh/bodyfat_train.csv?dl=1")
bodyfat_train
# Remove the fatBrozek, density, and hipin variables
<- bodyfat_train %>%
bodyfat_train select(-fatBrozek, -density, -hipin)
Exercise 1: 5 models
Consider the 5 models below:
<-
lm_spec linear_reg() %>%
set_engine(engine = 'lm') %>%
set_mode('regression')
<- fit(lm_spec,
mod1 ~ age+weight+neck+abdomen+thigh+forearm,
fatSiri data = bodyfat_train)
<- fit(lm_spec,
mod2 ~ age+weight+neck+abdomen+thigh+forearm+biceps,
fatSiri data = bodyfat_train)
<- fit(lm_spec,
mod3 ~ age+weight+neck+abdomen+thigh+forearm+biceps+chest+hip,
fatSiri data = bodyfat_train)
<- fit(lm_spec,
mod4 ~ ., # The . means all predictors
fatSiri data = bodyfat_train)
<- recipe(fatSiri ~ ., data = bodyfat_train) %>%
bf_recipe step_normalize(all_numeric_predictors())
<- workflow() %>%
bf_wf add_recipe(bf_recipe) %>%
add_model(lm_spec)
<- fit(bf_wf,
mod5 data = bodyfat_train)
STAT 155 review: Look at the tidy() of mod1. Contextually interpret the coefficient for the weight predictor. Is anything surprising? Why might this be?
Explain how mod5 is different than mod4. You may want to look at bf_recipe %>% prep(bodyfat_train) %>% juice() to see the preprocessed training data.
Which model will have the lowest training RMSE, and why? Explain before calculating (that is part d).
Compute the training RMSE for models 1 through 5 to check your answer for part c. Write a sentence interpreting one of values of RMSE in context. Below is a coding example to get you started.
= metric_set(rmse)
reg_metrics
%>%
mod1 predict(new_data = bodyfat_train) %>%
reg_metrics(truth = bodyfat_train$fatSiri, estimate = .pred)
- Which model do you think is the “best”? You may calculate MAE and R squared as well to justify your answer.
f, Which model do you think will perform worst on new test data? Why?
Exercise 2: Evaluating the Test Data
Now that you’ve thought about how well the models might perform on test data, deploy it in the real world by applying it to a new set of 172 adult males. You’ll need to update the new_data
to use bodyfat_test
instead of bodyfat_train
.
<- read_csv("https://www.dropbox.com/s/7gizws208u0oywq/bodyfat_test.csv?dl=1") bodyfat_test
- Calculate the test RMSE, MAE, and R squared for all five of the models.
Here is some code to get you started.
# Use fit/trained models and evaluate on test data
= metric_set(rmse, mae, rsq)
reg_metrics
%>%
mod1 predict(new_data = bodyfat_test) %>%
reg_metrics(truth = bodyfat_test$fatSiri, estimate = .pred)
Look back to Exercise 1 and see which model you thought was “best” based on the training data. Is that the “best” model in terms of predicting on new data? Explain.
In “real life” we only have one data set. To get a sense of predictive performance on new test data, we could split our data into two groups. Discuss pros and cons of ways you might split the data. How big should the training set be? How big should the testing set be?
Exercise 3: Overfitting
If you have time, consider the following relationship. Imagine a set of predictions that is overfit to this training data. You are not limited to lines. Draw a picture of that function of predictions on a piece of paper.
set.seed(123)
<- tibble::tibble(
data_train x = runif(15,0,7),
y = x^2 + rnorm(15,sd = 7)
)%>%
data_train ggplot(aes(x = x, y = y)) +
geom_point() +
theme_classic()