Topic 7 KNN Regression and the Bias-Variance Tradeoff
Learning Goals
- Clearly describe / implement by hand the KNN algorithm for making a regression prediction
- Explain how the number of neighbors relates to the bias-variance tradeoff
- Explain the difference between parametric and nonparametric methods
- Explain how the curse of dimensionality relates to the performance of KNN
Slides from today are available here.
KNN models in tidymodels
To build KNN models in tidymodels
, first load the package and set the seed for the random number generator to ensure reproducible results:
library(dplyr)
library(readr)
library(broom)
library(ggplot2)
library(tidymodels)
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions
set.seed(2023) # or choose your favorite number!
Then adapt the following code:
# CV Folds
<- vfold_cv(___, v = 10)
data_cv10
# Model Specification
<-
knn_spec nearest_neighbor() %>% # new type of model!
set_args(neighbors = tune()) %>% # tuning parameter is neighbor; tuning spec
set_engine(engine = 'kknn') %>% # new engine
set_mode('regression')
# Recipe with standardization (!)
<- recipe( ___ ~ ___ , data = ___) %>%
data_rec step_nzv(all_predictors()) %>% # removes variables with the same value
step_novel(all_nominal_predictors()) %>% # important if you have rare categorical variables
step_normalize(all_numeric_predictors()) %>% # important standardization step for KNN
step_dummy(all_nominal_predictors()) # creates indicator variables for categorical variables (important for KNN!)
# Workflow (Recipe + Model)
<- workflow() %>%
knn_wf add_model(knn_spec) %>%
add_recipe(data_rec)
# Tune model trying a variety of values for neighbors (using 8-fold CV)
<- grid_regular(
penalty_grid neighbors(range = c(1, 50)), # min and max of values for neighbors
levels = 50) # number of neighbors values
<- tune_grid(knn_wf, # workflow
knn_fit_cv resamples = data_cv10, #CV folds
grid = penalty_grid, # grid specified above
metrics = metric_set(rmse, mae))
Argument | Meaning |
---|---|
y ~ . |
Model formula for specifying response and predictors |
data |
Sample data |
preProcess |
"scale" indicates that predictor variables should be scaled to have the same variance (Why might this be important? Should we always do this?) |
method |
"knn" implements KNN regression (and classification) |
tuneGrid |
A mini-dataset (data.frame ) of tuning parameters. \(k\) is the KNN neighborhood size. Supply a sequence as seq(begin, end, by = size of step) . |
trControl |
Use cross-validation to estimate test performance for each model fit. The process used to pick a final model from among these is indicated by selectionFunction , with options including "best" and "oneSE" . |
metric |
Evaluate and compare competing models with respect to their CV-MAE. |
na.action |
Set na.action = na.omit to prevent errors if the data has missing values. |
Note: When including categorical predictors, tidymodels
automatically creates the corresponding indicator variables to allow Euclidean distance to still be used. You’ll think about the pros/cons of this in the exercises.
An alternative to creating indicator variables is to use Gower distance. We’ll explore Gower distance more when we talk about clustering.
Identifying the “best” KNN model
The “best” model in the sequence of models fit is defined relative to the chosen selectionFunction
and metric
.
%>% autoplot() # Visualize Trained Model using CV
knn_fit_cv
%>% show_best(metric = 'mae') # Show evaluation metrics for different values of neighbors, ordered
knn_fit_cv
# Choose value of Tuning Parameter (neighbors)
<- knn_fit_cv %>%
tuned_knn_wf select_by_one_std_err(metric = 'mae',desc(neighbors)) %>% # Choose neighbors value that leads to the highest neighbors within 1 se of the lowest CV MAE
finalize_workflow(knn_wf, .)
# Fit final KNN model to data
<- tuned_knn_wf %>%
knn_fit_final fit(data = __)
# Use the best model to make predictions
# new_data should be a data.frame with required predictors
predict(knn_fit_final, new_data = ___)
Exercises
You can download a template RMarkdown file to start from here.
We’ll explore KNN regression using the College
dataset in the ISLR
package (install it with install.packages("tidymodels")
in the Console). You can use ?College
in the Console to look at the data codebook.
library(ISLR)
library(dplyr)
library(readr)
library(broom)
library(ggplot2)
library(tidymodels)
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions
data(College)
# data cleaning
<- College %>%
college_clean mutate(school = rownames(College)) %>% # creates variable with school name
filter(Grad.Rate <= 100) # Remove one school with grad rate of 118%
rownames(college_clean) <- NULL # Remove school names as row names
Exercise 1: Bias-variance tradeoff warmup
- Think back to the LASSO algorithm which depends upon tuning parameter \(\lambda\).
- For which values of \(\lambda\) (small or large) will LASSO be the most biased, and why?
- For which values of \(\lambda\) (small or large) will LASSO be the most variable, and why?
- The bias-variance tradeoff also comes into play when comparing across algorithms, not just within algorithms. Consider LASSO vs. least squares:
- Which will tend to be more biased?
- Which will tend to be more variable?
- When will LASSO outperform least squares in the bias-variance tradeoff?
Exercise 2: Impact of distance metric
Consider the 1-nearest neighbor algorithm to predict Grad.Rate
on the basis of two predictors: Apps
and Private
. Let Yes
for Private
be represented with the value 1 and No
with 0.
We have a test case whose number of applications is 13,530 and is a private school. Suppose that we have the tiny 2-case training set below. What would the 1-nearest neighbor prediction be using Euclidean distance?
%>% college_clean filter(school %in% c("Princeton University", "SUNY at Albany")) %>% select(Apps, Private, Grad.Rate, school)
Do you have any concerns about the resulting prediction? Based on this, comment on the impact of the distance metric chosen on KNN performance. How might you change the distance calculation (or correspondingly rescale the data) to generate a more sensible prediction in this situation?
Exercise 3: Implementing KNN in tidymodels
Adapt our general KNN code to “fit” a set of KNN models with the following specifications:
- Use the predictors
Private
,Top10perc
(% of new students from top 10% of high school class), andS.F.Ratio
(student/faculty ratio). - Predict values of
Grad.Rate
- Use 8-fold CV. (Why 8? Take a look at the sample size.)
- Use mean absolute error (MAE) to select a final model.
- Select the simplest model for which the metric is within one standard error of the best metric.
- Use a sequence of \(K\) values from 1 to 100 in increments of 5.
- Should you use
preProcess = "scale"
?
After adapting the code (but before inspecting any output), answer the following conceptual questions:
- Explain your choice for using or not using
preProcess = "scale"
. - Why is “fit” in quotes? Does KNN actually fit a model as part of training? (This feature of KNN is known as “lazy learning”.)
- How is test MAE estimated? What are the steps of the KNN algorithm with cross-validation?
- Draw a picture of how you expect test MAE to vary with \(K\). In terms of the bias-variance tradeoff, why do you expect the plot to look this way?
set.seed(2023)
<- vfold_cv(____, v = ____)
data_cv8
# Model Specification
<-
knn_spec nearest_neighbor() %>% # new type of model!
set_args(neighbors = tune()) %>% # tuning parameter is neighbor; tuning spec
set_engine(engine = 'kknn') %>% # new engine
set_mode('regression')
# Recipe with standardization (!)
<- recipe( ___ ~ ___ , data = data_cv8) %>%
data_rec step_nzv(all_predictors()) %>% # removes variables with the same value
step_novel(all_nominal_predictors()) %>% # important if you have rare categorical variables
step_normalize(all_numeric_predictors()) %>% # important standardization step for KNN
step_dummy(all_nominal_predictors()) # creates indicator variables for categorical variables (important for KNN!)
# Workflow (Recipe + Model)
<- workflow() %>%
knn_wf add_model(knn_spec) %>%
add_recipe(data_rec)
# Tune model trying a variety of values for neighbors (using 8-fold CV)
<- grid_regular(
neighbor_grid neighbors(range = c(______, _____)), # min and max of values for neighbors
levels = _______) # number of neighbors values
<- tune_grid(knn_wf, # workflow
knn_fit_cv resamples = data_cv8, #CV folds
grid = neighbor_grid, # grid specified above
metrics = metric_set(rmse, mae))