Homework 5

Due Friday, April 28 at 11:59pm CST on Moodle

Project Work

Look at the final requirements on the Final Project page. There is nothing you must submit for this homework assignment, but you should start on the following:

Have one working regression model, with reasons why you chose that model.
Explore 2 different classification models to answer your classification research question
Consolidate code into one final Rmd file.

Ethics in ML (NOT REQUIRED)

This section is not required for this last HW. Take time for yourself. (Everyone will receive a free Meets Expectations on all areas for this assignment.)

In case it’s of interest and you have the bandwidth for it, the article “Race,” Racism, and the Practice of Epidemiology is a very illuminating one.

Portfolio Work

Revisions:

Make any revisions desired to previous concepts.
- Important formatting note: Please use a comment to mark the text that you want to be reread. (Highlight each span of text you want to be reread, and mark it with the comment “REVISION”.)

New concepts to address:

K-Means Clustering:

Algorithmic understanding: Perform two iterations of k-means with k = 2 on the dataset below. The data has just 1 variable x1, and the random initial cluster assignment is shown in the cluster column. Show your work: in particular, show the centroids computed for iterations 1 and 2 and the updated cluster assignments for iterations 1 and 2.

 x1   cluster
---- ---------
  1      1
  1      2
  3      2
  4      1
  5      1

Bias-variance tradeoff: (This is a prompt about clustering in general, but put your response in the K-Means section.) In clustering, we don’t quite have the same concepts of bias and variance as we do with supervised learning methods, but a similar type of tradeoff exists. Discuss the pros and cons of high vs. low number of clusters in terms of (1) ease of learning more about each cluster and (2) within-cluster homogeneity (closeness of cases within clusters). (5 sentences max.)
Scaling of variables: (This is a prompt that pertains to both k-means and hierarchical clustering, but put your response in the K-Means section.) Does the scale on which variables are measured matter for the performance of clustering? Why or why not? If scale does matter, how should this be addressed? (3 sentences max.)
Computational time: Consider a single round of the cluster reassignment step of k-means with \(n\) cases and \(k\) clusters. How many distance calculations are required in this step? Explain in at most 2 sentences.
Interpretation of output: (This is a prompt about clustering in general, but put your response in the K-Means section.) Describe data explorations we could use to interpret / learn more about the cluster assignments that clustering algorithms produce.

Hierarchical clustering:

Algorithmic understanding: We have a dataset with 4 cases, and the Euclidean distance between every pair of cases is shown below. (The column labeled 1 gives the distances of case 1 to cases 2, 3, and 4 from top to bottom.) Draw the dendrogram that would result from single-linkage clustering. Clearly label what cases are at each leaf and the heights at which fusions occur. Show any intermediate work.

     |   1       2       3
-----|------- ------- -------
  2  |  0.69       
  3  |  1.23    0.55 
  4  |  0.94    1.39    1.75

Bias-variance tradeoff: How does the tree cutting height relate to the tradeoff you discussed in the K-Means section? (2 sentences max.)
Computational time: Consider the very first step of hierarchical clustering in which all \(n\) cases are in their own cluster. How many distance calculations are required as a function of \(n\)? (Note: \(1 + 2 + \cdots + n = n(n+1)/2\).) Explain in at most 2 sentences.

Principal components analysis:

Algorithmic understanding: In no more than 4 sentences, summarize the goal of principal components analysis and how it allows us to perform dimension reduction. Use the following terms / ideas in your response: linear combination, variance.
Bias-variance tradeoff: How is dimension reduction related to the bias-variance tradeoff for some of the supervised methods we’ve covered? How is the use of the scree plot from PCA related to the tradeoff?
Scaling of variables: Does the scale on which variables are measured matter for the performance of this algorithm? Why or why not? If scale does matter, how should this be addressed when using this method? (3 sentences max.)
Interpretation of output: What information can we gain by looking at the loadings of the first few principal components? Explain in at most 3 sentences.