Data Wrangling

Six Main Verbs

James Normington

Due this Week

  • Assignment 4 due tomorrow night
  • At least 1 Tidy Tuesday (TT) by next Friday (2/24!)
    • If you haven’t turned one in, that’s either TT4 or TT5!

Data Wrangling

Getting the data in the tidy format that we want…

  • to visualize (glyph-ready: one row per glyph)
  • to summarize
  • to learn more about the data

Learning Goals

  • Understand and be able to use the following verbs appropriate: select, mutate, filter, arrange, summarize, group_by
  • Develop working knowledge of working with dates and lubridate functions

Six Main Verbs

Verbs that change the variables (columns) but not the cases (rows)

  • select
  • mutate

Verbs that change the cases (rows) but not the variables (columns)

  • filter
  • arrange

Grouped summaries

  • summarize

  • group_by

Six Main Verbs

## 
library(fivethirtyeight)
library(dplyr)
head(bechdel)
# A tibble: 6 × 15
   year imdb     title test  clean…¹ binary budget domgr…² intgr…³ code  budge…⁴
  <int> <chr>    <chr> <chr> <ord>   <chr>   <int>   <dbl>   <dbl> <chr>   <int>
1  2013 tt17114… 21 &… nota… notalk  FAIL   1.3 e7  2.57e7  4.22e7 2013…  1.3 e7
2  2012 tt13437… Dred… ok-d… ok      PASS   4.5 e7  1.34e7  4.09e7 2012…  4.57e7
3  2013 tt20245… 12 Y… nota… notalk  FAIL   2   e7  5.31e7  1.59e8 2013…  2   e7
4  2013 tt12728… 2 Gu… nota… notalk  FAIL   6.1 e7  7.56e7  1.32e8 2013…  6.1 e7
5  2013 tt04535… 42    men   men     FAIL   4   e7  9.50e7  9.50e7 2013…  4   e7
6  2013 tt13359… 47 R… men   men     FAIL   2.25e8  3.84e7  1.46e8 2013…  2.25e8
# … with 4 more variables: domgross_2013 <dbl>, intgross_2013 <dbl>,
#   period_code <int>, decade_code <int>, and abbreviated variable names
#   ¹​clean_test, ²​domgross, ³​intgross, ⁴​budget_2013

Verbs that change the variables (columns)

  • select
    • Subsets by column
bechdel_sub = bechdel %>% 
  select(title, clean_test, budget)

bechdel_sub
# A tibble: 1,794 × 3
   title                  clean_test    budget
   <chr>                  <ord>          <int>
 1 21 & Over              notalk      13000000
 2 Dredd 3D               ok          45000000
 3 12 Years a Slave       notalk      20000000
 4 2 Guns                 notalk      61000000
 5 42                     men         40000000
 6 47 Ronin               men        225000000
 7 A Good Day to Die Hard notalk      92000000
 8 About Time             ok          12000000
 9 Admission              ok          13000000
10 After Earth            notalk     130000000
# … with 1,784 more rows

mutate

  • Creates new variables
bechdel_sub %>%
  mutate(new_var_hi = "Hello!", 
         new_var_pass = (clean_test == "ok"), 
         budget_morethan30mil = (budget > 30000000)
  )
# A tibble: 1,794 × 6
   title                  clean_test    budget new_var_hi new_var_pass budget_…¹
   <chr>                  <ord>          <int> <chr>      <lgl>        <lgl>    
 1 21 & Over              notalk      13000000 Hello!     FALSE        FALSE    
 2 Dredd 3D               ok          45000000 Hello!     TRUE         TRUE     
 3 12 Years a Slave       notalk      20000000 Hello!     FALSE        FALSE    
 4 2 Guns                 notalk      61000000 Hello!     FALSE        TRUE     
 5 42                     men         40000000 Hello!     FALSE        TRUE     
 6 47 Ronin               men        225000000 Hello!     FALSE        TRUE     
 7 A Good Day to Die Hard notalk      92000000 Hello!     FALSE        TRUE     
 8 About Time             ok          12000000 Hello!     TRUE         FALSE    
 9 Admission              ok          13000000 Hello!     TRUE         FALSE    
10 After Earth            notalk     130000000 Hello!     FALSE        TRUE     
# … with 1,784 more rows, and abbreviated variable name ¹​budget_morethan30mil

Verbs that change the variables (rows)

  • filter
bechdel_sub %>%
  filter(clean_test == "ok")
# A tibble: 803 × 3
   title                clean_test    budget
   <chr>                <ord>          <int>
 1 Dredd 3D             ok          45000000
 2 About Time           ok          12000000
 3 Admission            ok          13000000
 4 American Hustle      ok          40000000
 5 August: Osage County ok          25000000
 6 Beautiful Creatures  ok          50000000
 7 Blue Jasmine         ok          18000000
 8 Carrie               ok          30000000
 9 Despicable Me 2      ok          76000000
10 Elysium              ok         120000000
# … with 793 more rows
  • arrange
bechdel_sub %>%
  arrange(title)
# A tibble: 1,794 × 3
   title                      clean_test   budget
   <chr>                      <ord>         <int>
 1 (500) Days of Summer       notalk      7500000
 2 [Rec]                      ok          2100000
 3 10 Things I Hate About You ok         13000000
 4 12 Years a Slave           notalk     20000000
 5 127 Hours                  dubious    18000000
 6 13 Going on 30             ok         30000000
 7 1408                       ok         22500000
 8 17 Again                   ok         40000000
 9 1776                       notalk      4000000
10 2 Fast 2 Furious           notalk     76000000
# … with 1,784 more rows
bechdel_sub %>%
  arrange(desc(title))
# A tibble: 1,794 × 3
   title                      clean_test   budget
   <chr>                      <ord>         <int>
 1 Zwartboek                  ok         22000000
 2 Zoom                       ok         35000000
 3 Zoolander                  ok         28000000
 4 Zombieland                 ok         23600000
 5 Zero Dark Thirty           ok         52500000
 6 Zathura: A Space Adventure nowomen    65000000
 7 Youth in Revolt            notalk     18000000
 8 Yours, Mine and Ours       ok         45000000
 9 Your Sister's Sister       ok           120000
10 Young Guns                 notalk     13000000
# … with 1,784 more rows

Verbs that change the variables (rows)

  • summarize
    • collapses rows and calculates a summary
bechdel_sub %>%
  summarize(mean(budget),
            median(budget))
# A tibble: 1 × 2
  `mean(budget)` `median(budget)`
           <dbl>            <dbl>
1      44826463.         28000000
  • group_by
    • Defines groups within data
bechdel_sub %>%
  group_by(clean_test) %>%
  summarize(mean(budget),
            median(budget))
# A tibble: 5 × 3
  clean_test `mean(budget)` `median(budget)`
  <ord>               <dbl>            <dbl>
1 nowomen         48606738.         28000000
2 notalk          53225573.         39000000
3 men             45127098.         27750000
4 dubious         49263380.         28750000
5 ok              37929168.         25000000

Visual Illustration

https://tidydatatutor.com/vis.html#

Template File

Download a template .Rmd of this activity. Put the file in a Day_07 folder within your COMP_STAT_112 folder.

  • This .Rmd only contains exercises that we’ll work on in class and you’ll finish for Assignment 6.

Data Example

The data table Birthdays in the mosaicData package gives the number of births recorded on each day of the year in each state from 1969 to 1988.

library(mosaicData)
Birthdays <- Birthdays %>% select(state, date, year, births)

First Exercise

Consider the Birthdays data

  1. Add two new variables to the Birthdays data: one that has only the last two digits of the year, and one that states whether there were more than 100 births in the given state on the given date.
BirthdaysExtra <- mutate(Birthdays, 
      year_short = year - 1900,
      busy_birthday = (births > 100)
)
  1. Then form a new table that only has three columns: the state and your two new columns.
BirthdaysExtraTable <- select(BirthdaysExtra, 
                         state,
                         year_short,
                         busy_birthday
                         )
  1. What does the following operation return: select(Birthdays, ends_with("te"))?
select(Birthdays, ends_with("te")) %>% head()
  state       date
1    AK 1969-01-01
2    AL 1969-01-01
3    AR 1969-01-01
4    AZ 1969-01-01
5    CA 1969-01-01
6    CO 1969-01-01

Second Exercise

Create a table with only births in Massachusetts in 1979, and sort the days from those with the most births to those with the fewest.

MABirths1979 <- filter(Birthdays, state == "MA", year == 1979)
MABirths1979Sorted <- arrange(MABirths1979, desc(births))

head(MABirths1979Sorted)
  state       date year births
1    MA 1979-09-28 1979    262
2    MA 1979-09-11 1979    252
3    MA 1979-12-28 1979    249
4    MA 1979-09-26 1979    246
5    MA 1979-07-24 1979    245
6    MA 1979-04-27 1979    243

Third Exercise

Consider the Birthdays data again.

  1. Find the average number of daily births (per state) in each year.
  2. Find the average number of daily births in each year, by state.
BirthdaysYear <- group_by(Birthdays, year)
summarise(BirthdaysYear, average = mean(births))
# A tibble: 20 × 2
    year average
   <int>   <dbl>
 1  1969    192.
 2  1970    200.
 3  1971    191.
 4  1972    175.
 5  1973    169.
 6  1974    170.
 7  1975    169.
 8  1976    170.
 9  1977    179.
10  1978    179.
11  1979    188.
12  1980    194.
13  1981    195.
14  1982    198.
15  1983    196.
16  1984    197.
17  1985    202.
18  1986    202.
19  1987    205.
20  1988    210.
BirthdaysYearState <- group_by(Birthdays, year, state)
summarise(BirthdaysYearState, average = mean(births))
# A tibble: 1,020 × 3
# Groups:   year [20]
    year state average
   <int> <chr>   <dbl>
 1  1969 AK       18.6
 2  1969 AL      174. 
 3  1969 AR       91.3
 4  1969 AZ       93.3
 5  1969 CA      954. 
 6  1969 CO      110. 
 7  1969 CT      134. 
 8  1969 DC       75.3
 9  1969 DE       27.6
10  1969 FL      292. 
# … with 1,010 more rows

Piping

QuickMABirths1979 <-
  Birthdays %>%
  filter(state == "MA", year == 1979) %>%
  arrange(desc(births))

With the pipe notation, x %>% f(y) reads as apply function f to the data frame x, and y are additional arguments. Above, x is Birthdays, f is filter(), and y is state == "MA" and year == 1979 .

Dates

Birthdays <-
  Birthdays %>%
  mutate(
    month = month(date, label = TRUE),
    weekday = wday(date, label = TRUE)
  )

Make a table showing the five states with the most births between September 9, 1979 and September 11, 1979, inclusive. Arrange the table in descending order of births.

  Birthdays %>%
  filter(date >= ymd("1979-09-09"), date <= ymd("1979-09-11")) %>%
  group_by(state) %>%
  summarise(total = sum(births)) %>%
  arrange(desc(total)) %>%
  head(n = 5)
# A tibble: 5 × 2
  state total
  <chr> <int>
1 CA     3246
2 TX     2347
3 NY     1943
4 IL     1673
5 OH     1408

Rest of Class

Continue working on the activity; check in with your classmates.

Don’t leave anyone left struggling alone!

After Class

This activity is all code, no interpretations.

There are many exercises to give you plenty of practice with these important six tasks!

You’ll finish the activity as part of Assignment 5.