Data Import

James Normington

Midterm Grades

Function of two assessments:
- Engagement
- Midterm score
Final grade will be comprised of these two, plus:
- Final Project
- Iterative Viz

Final Project

Brainstorming

Start thinking about Final Projects! I’ll finalize the template/rubric next week at the latest.

What interests you? What do you want to learn about? What are you an expert in?

Choose something you can stick with! Trust me!

Midterm Revisions – Part 1

For each problem marked wrong:

write a more correct answer and 1 sentence description of why that is a more correct answer (maybe why the original answer wasn’t correct).

Talk with others in the class; help each other understand the WHY.

Turn paper copy into me by next class (Thursday, March 23rd @ 8am).

Midterm Revisions Part 2

You should have been notified of a shared pdf with feedback on Moodle.
Talk through some of the stumbling blocks with your classmates. Take notes for yourself.
By Friday night at 11:59pm, submit an updated version of the Midterm Part 2 to Moodle

My Deal: You may talk to others in the class (not preceptors, not people who have previously taken it) but you may not directly share code with each other. Instead, talk about the actions more conceptually and point each other to resources.

Note: If you got an “A” on the midterm, your Final Grade won’t be improved with revisions.

Learning Goals

Develop comfort in finding an existing data set to import into R
Develop comfort in importing data of a variety of file types into R
Understand and implement the data cleaning process to make values consistent
Understand the implications of different ways of dealing with missing values with replace_na and drop_na

Key Ideas

Finding Existing Data Sets
Loading Data Sets
Checking Imported Data
Cleaning Data

More thorough notes available at https://jamesnormington.github.io/112_spring_2023/data-import.html

Finding Existing Data Sets

Google: “movie reviews filetype:csv”
awesome-public-datasets GitHub repo
Kaggle data
data.world website
Prof Heggeseth’s website

Loading Data Sets

Depending on the file type (csv, tsv, excel, Google sheet, stata file, shapefile, etc.), you’ll need to adjust the function you use. Here are some of the most common:

read_csv()
read_delim()
read_sheet()
st_read()

Loading Data Sets

The Import Wizard can help you write the code!

Try importing data from:

https://jamesnormington.github.io/112_spring_2023/data/imdb_5000_messy.csv

Note: When using the Import Wizard, make sure to copy and paste the code into a Rmd file.

Checking Imported Data

Always look at the data after importing with View()
Do a quick summary of all variables:

dataset_name %>% 
  mutate(across(where(is.character), as.factor)) %>% 
  summary()

Cleaning Data

Cleaning Categorical Variables

“Clean” data has consistent values in terms of spelling and capitalization.

How could we clean this up?

imdbMessy <- read_csv("https://jamesnormington.github.io/112_spring_2023/data/imdb_5000_messy.csv")
imdbMessy %>% count(color)

# A tibble: 6 × 2
  color               n
  <chr>           <int>
1 B&W                10
2 Black and White   199
3 color              30
4 Color            4755
5 COLOR              30
6 <NA>               19

imdbMessy <- imdbMessy %>% 
  mutate(color = factor(color)) %>%  #convert to a factor
  mutate(color = fct_recode(color, "B&W" = "Black and White", "Color" = "color", "Color" = "COLOR")) #recode levels

imdbMessy %>% count(color)

# A tibble: 3 × 2
  color     n
  <fct> <int>
1 B&W     209
2 Color  4815
3 <NA>     19

Cleaning Data

Study the individual observations with NAs carefully.

Why do you think they are missing?
Are certain types of observations more likely to have NAs?

Addressing Missing Data

You have several options for dealing with NAs (and they have different consequences):

You can remove observations with one or more NAs (see drop_na).
You can remove columns with many NA values with select.
You can replace NAs with a reasonable value (called imputing values). This could be a default value (like zero), or the average for a column. (see replace_na)

Cleaning Data

Let’s check to see how many values are missing per variable.

colSums(is.na(imdbMessy)) # Number of values that are missing per variable

                     ...1                     color             director_name 
                        0                        19                       104 
   num_critic_for_reviews                  duration   director_facebook_likes 
                       50                        15                       104 
   actor_3_facebook_likes              actor_2_name    actor_1_facebook_likes 
                       23                        13                         7 
                    gross                    genres              actor_1_name 
                      884                         0                         7 
              movie_title           num_voted_users cast_total_facebook_likes 
                        0                         0                         0 
             actor_3_name      facenumber_in_poster             plot_keywords 
                       23                        13                       153 
          movie_imdb_link      num_user_for_reviews                  language 
                        0                        21                        12 
                  country            content_rating                    budget 
                        5                       303                       492 
               title_year    actor_2_facebook_likes                imdb_score 
                      108                        13                         0 
             aspect_ratio      movie_facebook_likes 
                      329                         0

Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?

imdbMessy %>% filter(is.na(actor_1_facebook_likes)) %>% select(movie_title,actor_1_name,actor_1_facebook_likes) %>% head()

# A tibble: 6 × 3
  movie_title              actor_1_name actor_1_facebook_likes
  <chr>                    <chr>                         <dbl>
1 Pink Ribbons, Inc.       <NA>                             NA
2 Sex with Strangers       <NA>                             NA
3 The Harvest/La Cosecha   <NA>                             NA
4 Ayurveda: Art of Being   <NA>                             NA
5 The Brain That Sings     <NA>                             NA
6 The Blood of My Brother  <NA>                             NA

To remove observations (rows) that are missing actor_1_facebook_likes,

imdbMessysub <- imdbMessy %>% filter(!is.na(actor_1_facebook_likes))  #Notice how I saved this new smaller dataset to a new name

To replace missing values of actor_1_facebook_likes with 0,

imdbMessysub2 <- imdbMessy %>% mutate(actor_1_facebook_likes = replace_na(actor_1_facebook_likes, 0))

Assignment Part 1 (due Wed. 3/29 @ 11:59pm)

Find a dataset that is not built into R and is related to one of the following topics:

a hobby or passion
hometown or place you have lived
Something you know nothing about, but want to learn about

Load the data into R, make sure it is clean, and construct one interesting visualization of the data.

Note: this might help you brainstorm ideas for projects

After Class

Midterm Revisions Part 1 due Thursday 3/23 @ 8am
Midterm Revisions Part 2 due Friday 3/24 @ 11:59pm
IV1 due Friday 3/24 @ 11:59pm
Assignment 8 (Parts 1 and 2) due Wed., 3/29 @ 11:59pm
Start thinking about your Final Project! (more next week)