Topic 2 Evaluating Regression Models

Learning Goals

Create and interpret residuals vs. fitted, residuals vs. predictor plots to identify improvements in modeling and address ethical concerns
Interpret MSE, RMSE, MAE, and R-squared in a contextually meaningful way

We’ll also start to develop some new ideas relating to our next topics.

Slides from today are available here.

Exercises

You can download a template RMarkdown file to start from here.

Context

We’ll be working with a dataset containing physical measurements on 80 adult males. These measurements include body fat percentage estimates as well as body circumference measurements.

fatBrozek: Percent body fat using Brozek’s equation: 457/Density - 414.2
fatSiri: Percent body fat using Siri’s equation: 495/Density - 450
density: Density determined from underwater weighing (gm/cm^3).
age: Age (years)
weight: Weight (lbs)
height: Height (inches)
neck: Neck circumference (cm)
chest: Chest circumference (cm)
abdomen: Abdomen circumference (cm)
hip: Hip circumference (cm)
thigh: Thigh circumference (cm)
knee: Knee circumference (cm)
ankle: Ankle circumference (cm)
biceps: Biceps (extended) circumference (cm)
forearm: Forearm circumference (cm)
wrist: Wrist circumference (cm)

It takes a lot of effort to estimate body fat percentage accurately through underwater weighing. The goal is to build the best predictive model for fatSiri using just circumference measurements, which are more easily attainable. (We won’t use fatBrozek or density as predictors because they’re other outcome variables.)

library(readr)
library(ggplot2)
library(dplyr)
library(broom)
library(tidymodels)
tidymodels_prefer()

bodyfat <- read_csv("https://www.dropbox.com/s/js2gxnazybokbzh/bodyfat_train.csv?dl=1")

# Remove the fatBrozek and density variables
bodyfat <- bodyfat %>%
    select(-fatBrozek, -density)

Class investigations

We’ll work through this section together to review concepts and code. You’ll then work on the remainder of the exercises in your groups.

# Exploratory plots
# Univariate distribution of outcome
ggplot(bodyfat, aes(???)) +
    geom_???()

# Scatterplot of fatSiri vs. weight
ggplot(bodyfat, aes(???)) +
    geom_???()

Let’s fit a linear regression model to predict body fat percentage from weight.

mod1 <- lm(??? ~ ???, data = bodyfat)

We can use the augment() function from the broom package to augment our original data with useful information from the model fitting. In particular, residuals are stored in the .fitted column, and fitted (predicted) values for the cases supplied in newdata are stored in the .fitted column.

mod1_output <- broom::augment(mod1, newdata = bodyfat)
head(mod1_output)

We can use the augment() output to compute error metrics, and glance() to obtain R^2:

mod1 %>%
  augment() %>%
  summarize(
    mse = mean((fatSiri - .fitted)^2),
    rmse = sqrt(mse),
    mae = mean(abs(fatSiri - .fitted))
  )

# R-squared - interpretation? (unit-less)
mod1 %>%
  glance() %>%
  select(r.squared)

…and to create residual plots:

# Univariate plot of residuals
mod1 %>%
  augment() %>%
  ggplot(aes(x = ???)) +
  geom_histogram() +
  theme_minimal()

# Fitted vs. Residual plot
mod1 %>%
  augment() %>%
  ggplot(aes(x = ???, y = ???)) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 0) +
  theme_minimal()

# Predictor vs. Residual plot
mod1 %>% 
  augment() %>%
  left_join(bodyfat) %>% #Merge the remaining variables into data set
  ggplot(aes(x = height, y = ???)) + #note patterns in residual/error with height
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 0) +
  theme_minimal()

Exercise 1

First decide on what you think would be a good model by picking variables based on context. Fit this model, calling it mod_initial. (Remember that you can include several predictors with a + in the model formula - like y ~ x1+x2.)

# Code to fit initial model

Use residual plot explorations to check if you need to update your model.

# Residual plot explorations

Fit your updated model, and call it model_updated.

# Code to fit updated model

Exercise 2

Compute and contextually interpret relevant evaluation metrics for your model.

# Code to compute evaluation metrics

Exercise 3

Now that you’ve selected your best model, deploy it in the real world by applying it to a new set of 172 adult males. You’ll need to update the newdata argument of the broom::augment() code to use bodyfat_test instead of bodyfat.

bodyfat_test <- read_csv("https://www.dropbox.com/s/7gizws208u0oywq/bodyfat_test.csv?dl=1")

Compare your evaluation metrics from Exercise 2 the metrics here. What do you notice? (Note: this observation is just based on your one particular fitted model. You’ll make some more comprehensive observations in the next exercise.)
In general, do you think that models with more or less variables would perform better in this comparison? Explain.

Exercise 4

The code below systematically looks at the same comparison that you made in Exercise 3 but for every possible linear regression model formed from inclusion/exclusion of the predictors (without transformations or interactions).

Run the code to make a plot of the results of this systematic investigation. (Feel free to inspect the code if you’re curious, but otherwise, don’t worry about understanding it fully.)

What do you notice? What do you wonder?

get_maes <- function(mod, train_data, test_data) {
    mod_output_train <- broom::augment(mod, newdata = train_data)
    mod_output_test <- broom::augment(mod, newdata = test_data)
    train_mae <- mean(abs(mod_output_train$.resid))
    test_mae <- mean(abs(mod_output_test$.resid))
    c(train_mae, test_mae)
}


possible_predictors <- setdiff(colnames(bodyfat), c("fatSiri", "hipin"))
results <- bind_rows(lapply(1:13, function(i) {
    combos <- combn(possible_predictors, i)
    bind_rows(lapply(seq_len(ncol(combos)), function(j) {
        formula <- paste("fatSiri ~", paste(combos[,j], collapse = "+"))
        mod <- lm(as.formula(formula), data = bodyfat)
        maes <- get_maes(mod = mod, train_data = bodyfat, test_data = bodyfat_test)
        tibble(
            form = formula,
            train_mae = maes[1],
            test_mae = maes[2],
            num_predictors = i
        )
    }))
}))

# Relabel the num_predictors variable
results <- results %>%
    mutate(num_predictors = paste("# predictors:", num_predictors)) %>%
    mutate(num_predictors = factor(num_predictors, levels = paste("# predictors:", 1:13)))

# Plot results
ggplot(results, aes(x = train_mae, y = test_mae, color = num_predictors)) + 
    geom_point() + 
    coord_cartesian(xlim = c(0,7.5), ylim = c(0,7.5)) + 
    geom_abline(slope = 1, intercept = 0) + 
    facet_wrap(~num_predictors) +
    guides(color = FALSE) +
    labs(x = "MAE on original data", y = "MAE on new data on 172 men") +
    theme_classic()