Homework 1

**Submit by Sunday, Feb. 5th at 11:59pm to Moodle.

Please turn in a single PDF document containing (1) your responses for the Project Work and Ethics in ML sections and (2) a LINK to the Google Doc with your responses for the Portfolio Work section.

Project Work

Goal: Find a dataset (or datasets) to use for your final project, and start to get to know the data.

Details:

Your dataset(s) should allow you to perform a (1) regression, (2) classification, and (3) unsupervised learning analysis. The following resources are good places to start looking for data:

Even if you end up working with a partner on the project (which isn’t required - working alone is fine), please complete this initial work individually. It’s fine if you and a potential/future partner end up using the same dataset and collaborate on the finding of data, but complete the short bit of writing (below) individually.

Check in with the instructor early if you need help.

Deliverables:

Write 1-2 paragraphs (no more than 350 words) summarizing:

The information in the dataset(s) and the context behind the data. Use the prompts below to guide your thoughts. (Note: in some situations, there may be incomplete information on the data context. That’s fine. Just do your best to summarize what information is available, and acknowledge the lack of information where relevant.)
- What are the cases?
- Broadly describe the variables contained in the data.
- Who collected the data? When, why, and how?
3 research questions
- 1 that can be investigated in a regression setting
- 1 that can be investigated in a classification setting
- 1 that can be investigated in an unsupervised learning setting

Also make sure that you can read the data into R. You don’t need to do any analysis in R yet, but making sure that you can read the data will make the next steps go more smoothly.

Ethics in ML

Read the article Amazon scraps secret AI recruiting tool that showed bias against women. Write a short (roughly 250 words), thoughtful response about the themes and cautions that the article brings forth.

Portfolio Work

Setup: In addition to your submission here, you’ll want to collect your Portfolio work in a single document. James shared a Google doc link with you on January 24th where you should keep your Portfolio responses. For each Homework submission, copy your Portfolio responses into the appropriate space. For example, Homework 1 has three prompts: Overfitting, Evaluating Regression Models, and Cross-validation. In your portfolio, copy and paste your responses under the appropriate header.

Page maximum: 2 pages of text (pictures don’t count)

Organization: Your choice! Use titles and section headings that make sense to you. (It probably makes sense to have a separate section for each method.)

Deliverables: Put your responses for this part in a Google Doc, and update the link sharing so that anyone with the link at Macalester College can edit. Include the URL for the Google Doc in your submission.

Note: Some prompts below may seem very open-ended. This is intentional. Crafting good responses requires looking back through our material to organize the concepts in a coherent, thematic way, which is extremely useful for your learning.

Concepts to address:

Overfitting: The video used the analogy of a cat picture model to explain overfitting. Come up with your own analogy to explain overfitting.
Evaluating regression models: Describe how residuals are central to the evaluation of regression models. Explain how they arise in quantitative evaluation metrics and how they are used in evaluation plots. Include examples of plots that show desirable and undesirable model behavior (feel free to draw them by hand if you wish) and what steps can be taken to address that undesirable behavior.
Cross-validation: In your own words, explain the rationale for cross-validation in relation to overfitting and model evaluation. Describe the algorithm in your own words in at most 2 sentences.