Topic 4 Cross-validation

Learning Goals

  • Explain why training/in-sample model evaluation metrics can provide a misleading view of true test/out-of-sample performance
  • Accurately describe all steps of cross-validation to estimate the test/out-of-sample version of a model evaluation metric
  • Explain what role CV has in a predictive modeling analysis and its connection to overfitting
  • Explain the pros/cons of higher vs. lower k in k-fold CV in terms of sample size and computing time
  • Implement cross-validation in R using the tidymodels package


Slides from today are available here.




Exercises

You can download a template RMarkdown file to start from here.

Context

We’ll be working with a dataset containing physical measurements on 80 adult males. These measurements include body fat percentage estimates as well as body circumference measurements.

  • fatBrozek: Percent body fat using Brozek’s equation: 457/Density - 414.2
  • fatSiri: Percent body fat using Siri’s equation: 495/Density - 450
  • density: Density determined from underwater weighing (gm/cm^3).
  • age: Age (years)
  • weight: Weight (lbs)
  • height: Height (inches)
  • neck: Neck circumference (cm)
  • chest: Chest circumference (cm)
  • abdomen: Abdomen circumference (cm)
  • hip: Hip circumference (cm)
  • thigh: Thigh circumference (cm)
  • knee: Knee circumference (cm)
  • ankle: Ankle circumference (cm)
  • biceps: Biceps (extended) circumference (cm)
  • forearm: Forearm circumference (cm)
  • wrist: Wrist circumference (cm)

It takes a lot of effort to estimate body fat percentage accurately through underwater weighing. The goal is to build the best predictive model for fatSiri using just circumference measurements, which are more easily attainable. (We won’t use fatBrozek or density as predictors because they’re other outcome variables.)

library(readr)
library(ggplot2)
library(dplyr)
library(broom)
library(tidymodels)
bodyfat_train <- read_csv("https://www.dropbox.com/s/js2gxnazybokbzh/bodyfat_train.csv?dl=1")

# Remove the fatBrozek, density, and hipin variables
bodyfat_train <- bodyfat_train %>%
    select(-fatBrozek, -density, -hipin)

and consider the first four models we built on Thursday:

lm_spec <-
    linear_reg() %>% 
    set_engine(engine = 'lm') %>% 
    set_mode('regression')

mod1 <- fit(lm_spec,
            fatSiri ~ age+weight+neck+abdomen+thigh+forearm, 
            data = bodyfat_train)

mod2 <- fit(lm_spec,
            fatSiri ~ age+weight+neck+abdomen+thigh+forearm+biceps, 
            data = bodyfat_train)

mod3 <- fit(lm_spec,
            fatSiri ~ age+weight+neck+abdomen+thigh+forearm+biceps+chest+hip, 
            data = bodyfat_train)

mod4 <- fit(lm_spec,
            fatSiri ~ .,  # The . means all predictors
            data = bodyfat_train) 

Exercise 1: Cross-validation in Concept

We are going to repeat what we did last week but use cross-validation to help us evaluate models in terms of the predictive performance.

Explain to your table-mates the steps of cross validation (CV) in concept and then how you might use 10-fold CV with these 80 individual data points.

Exercise 2: Cross-validation with tidymodels

  1. Complete the code below to perform 10-fold cross-validation for mod1 to estimate the test RMSE (\(\text{CV}_{(10)}\)). Do we need to use set.seed()? Why or why not? (Is there a number of folds for which we would not need to set the seed?)

    # Do we need to use set.seed()?
    set.seed(2023)
    bodyfat_cv <- vfold_cv(??, v = 10)
    
    model_wf <- workflow() %>%
      add_formula(??) %>%
      add_model(lm_spec)
    
    mod1_cv <- fit_resamples(model_wf,
      resamples = bodyfat_cv, 
      metrics = metric_set(rmse, rsq,  mae)
    )
  2. Run the code below, and use this to calculate the 10-fold cross-validated RMSE “by hand” (you can use R code, but apply the formula mathematically).

mod1_cv %>% unnest(.metrics)
  1. Run the code below, and compare your answer to part b.
mod1_cv %>% collect_metrics()

Exercise 3: Looking at the evaluation metrics

Perform 10-fold CV using mod2, mod3, and mod4 by running the code below:

model2_wf <- workflow() %>%
  add_formula(fatSiri ~ age+weight+neck+abdomen+thigh+forearm+biceps) %>%
  add_model(lm_spec)

mod2_cv <- fit_resamples(model2_wf,
  resamples = bodyfat_cv, 
  metrics = metric_set(rmse, rsq, mae)
)

model3_wf <- workflow() %>%
  add_formula(fatSiri ~ age+weight+neck+abdomen+thigh+forearm+biceps+chest+hip) %>%
  add_model(lm_spec)

mod3_cv <- fit_resamples(model3_wf,
  resamples = bodyfat_cv, 
  metrics = metric_set(rmse, rsq, mae)
)

model4_wf <- workflow() %>%
  add_formula(fatSiri ~ .) %>%
  add_model(lm_spec)

mod4_cv <- fit_resamples(model4_wf,
  resamples = bodyfat_cv, 
  metrics = metric_set(rmse, rsq, mae)
)

mod1_cv %>% collect_metrics() %>% filter(.metric == "rmse")
mod2_cv %>% collect_metrics() %>% filter(.metric == "rmse")
mod3_cv %>% collect_metrics() %>% filter(.metric == "rmse")
mod4_cv %>% collect_metrics() %>% filter(.metric == "rmse")

Look at the completed table below of evaluation metrics for the 4 models.

Model Training RMSE \(\text{CV}_{(10)}\)
mod1 3.811 4.193
mod2 3.767 4.305
mod3 3.752 4.368
mod4 3.572 4.438
  1. Which model performed the best on the training data?
  2. Which model performed best on the test set?
  3. Explain why there’s a discrepancy between these 2 answers and why CV, in general, can help reduce the impact overfitting.

Exercise 4: Practical issues: choosing \(k\)

  1. In terms of sample size, what are the pros/cons of low vs. high \(k\)?
  2. In terms of computational time, what are the pros/cons of low vs. high \(k\)?
  3. If possible, it is advisable to choose \(k\) to be a divisor of the sample size. Why do you think that is?

Digging deeper

If you have time, consider these exercises to further explore concepts related to today’s ideas. Consider leave-one-out-cross-validation (LOOCV)

  • Would two different seeds make a difference in the results (using set.seed)? Why or why not?
  • Using the information from your_output %>% unnest(.metrics), construct a visualization to examine the variability of RMSE from case to case. What might explain any large values? What does this highlight about the quality of estimation of LOOCV?