Function of two assessments:
Engagement
Midterm score
Final grade will be comprised of these two, plus:
Final Project
Iterative Viz
Start thinking about Final Projects! I’ll finalize the template/rubric next week at the latest.
What interests you? What do you want to learn about? What are you an expert in?
Choose something you can stick with! Trust me!
For each problem marked wrong:
Talk with others in the class; help each other understand the WHY.
Turn paper copy into me by next class (Thursday, March 23rd @ 8am).
You should have been notified of a shared pdf with feedback on Moodle.
Talk through some of the stumbling blocks with your classmates. Take notes for yourself.
By Friday night at 11:59pm, submit an updated version of the Midterm Part 2 to Moodle
My Deal: You may talk to others in the class (not preceptors, not people who have previously taken it) but you may not directly share code with each other. Instead, talk about the actions more conceptually and point each other to resources.
Note: If you got an “A” on the midterm, your Final Grade won’t be improved with revisions.
replace_na
and drop_na
More thorough notes available at https://jamesnormington.github.io/112_spring_2023/data-import.html
Depending on the file type (csv, tsv, excel, Google sheet, stata file, shapefile, etc.), you’ll need to adjust the function you use. Here are some of the most common:
read_csv()
read_delim()
read_sheet()
st_read()
The Import Wizard can help you write the code!
Try importing data from:
https://jamesnormington.github.io/112_spring_2023/data/imdb_5000_messy.csv
Note: When using the Import Wizard, make sure to copy and paste the code into a Rmd file.
Always look at the data after importing with View()
Do a quick summary of all variables:
dataset_name %>%
mutate(across(where(is.character), as.factor)) %>%
summary()
Cleaning Categorical Variables
“Clean” data has consistent values in terms of spelling and capitalization.
How could we clean this up?
Study the individual observations with NAs carefully.
Addressing Missing Data
You have several options for dealing with NAs (and they have different consequences):
drop_na
).select
.replace_na
)Let’s check to see how many values are missing per variable.
...1 color director_name
0 19 104
num_critic_for_reviews duration director_facebook_likes
50 15 104
actor_3_facebook_likes actor_2_name actor_1_facebook_likes
23 13 7
gross genres actor_1_name
884 0 7
movie_title num_voted_users cast_total_facebook_likes
0 0 0
actor_3_name facenumber_in_poster plot_keywords
23 13 153
movie_imdb_link num_user_for_reviews language
0 21 12
country content_rating budget
5 303 492
title_year actor_2_facebook_likes imdb_score
108 13 0
aspect_ratio movie_facebook_likes
329 0
Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?
imdbMessy %>% filter(is.na(actor_1_facebook_likes)) %>% select(movie_title,actor_1_name,actor_1_facebook_likes) %>% head()
# A tibble: 6 × 3
movie_title actor_1_name actor_1_facebook_likes
<chr> <chr> <dbl>
1 Pink Ribbons, Inc. <NA> NA
2 Sex with Strangers <NA> NA
3 The Harvest/La Cosecha <NA> NA
4 Ayurveda: Art of Being <NA> NA
5 The Brain That Sings <NA> NA
6 The Blood of My Brother <NA> NA
To remove observations (rows) that are missing actor_1_facebook_likes
,
To replace missing values of actor_1_facebook_likes
with 0,
Find a dataset that is not built into R and is related to one of the following topics:
Load the data into R, make sure it is clean, and construct one interesting visualization of the data.
Note: this might help you brainstorm ideas for projects
Midterm Revisions Part 1 due Thursday 3/23 @ 8am
Midterm Revisions Part 2 due Friday 3/24 @ 11:59pm
IV1 due Friday 3/24 @ 11:59pm
Assignment 8 (Parts 1 and 2) due Wed., 3/29 @ 11:59pm
Start thinking about your Final Project! (more next week)