# A tibble: 1,794 × 3
title clean_test budget
<chr> <ord> <int>
1 21 & Over notalk 13000000
2 Dredd 3D ok 45000000
3 12 Years a Slave notalk 20000000
4 2 Guns notalk 61000000
5 42 men 40000000
6 47 Ronin men 225000000
7 A Good Day to Die Hard notalk 92000000
8 About Time ok 12000000
9 Admission ok 13000000
10 After Earth notalk 130000000
# … with 1,784 more rows
# A tibble: 1,794 × 6
title clean_test budget new_var_hi new_var_pass budget_…¹
<chr> <ord> <int> <chr> <lgl> <lgl>
1 21 & Over notalk 13000000 Hello! FALSE FALSE
2 Dredd 3D ok 45000000 Hello! TRUE TRUE
3 12 Years a Slave notalk 20000000 Hello! FALSE FALSE
4 2 Guns notalk 61000000 Hello! FALSE TRUE
5 42 men 40000000 Hello! FALSE TRUE
6 47 Ronin men 225000000 Hello! FALSE TRUE
7 A Good Day to Die Hard notalk 92000000 Hello! FALSE TRUE
8 About Time ok 12000000 Hello! TRUE FALSE
9 Admission ok 13000000 Hello! TRUE FALSE
10 After Earth notalk 130000000 Hello! FALSE TRUE
# … with 1,784 more rows, and abbreviated variable name ¹budget_morethan30mil
Verbs that change the variables (rows)
filter
bechdel_sub %>%filter(clean_test =="ok")
# A tibble: 803 × 3
title clean_test budget
<chr> <ord> <int>
1 Dredd 3D ok 45000000
2 About Time ok 12000000
3 Admission ok 13000000
4 American Hustle ok 40000000
5 August: Osage County ok 25000000
6 Beautiful Creatures ok 50000000
7 Blue Jasmine ok 18000000
8 Carrie ok 30000000
9 Despicable Me 2 ok 76000000
10 Elysium ok 120000000
# … with 793 more rows
arrange
bechdel_sub %>%arrange(title)
# A tibble: 1,794 × 3
title clean_test budget
<chr> <ord> <int>
1 (500) Days of Summer notalk 7500000
2 [Rec] ok 2100000
3 10 Things I Hate About You ok 13000000
4 12 Years a Slave notalk 20000000
5 127 Hours dubious 18000000
6 13 Going on 30 ok 30000000
7 1408 ok 22500000
8 17 Again ok 40000000
9 1776 notalk 4000000
10 2 Fast 2 Furious notalk 76000000
# … with 1,784 more rows
bechdel_sub %>%arrange(desc(title))
# A tibble: 1,794 × 3
title clean_test budget
<chr> <ord> <int>
1 Zwartboek ok 22000000
2 Zoom ok 35000000
3 Zoolander ok 28000000
4 Zombieland ok 23600000
5 Zero Dark Thirty ok 52500000
6 Zathura: A Space Adventure nowomen 65000000
7 Youth in Revolt notalk 18000000
8 Yours, Mine and Ours ok 45000000
9 Your Sister's Sister ok 120000
10 Young Guns notalk 13000000
# … with 1,784 more rows
Add two new variables to the Birthdays data: one that has only the last two digits of the year, and one that states whether there were more than 100 births in the given state on the given date.
BirthdaysExtra <-mutate(Birthdays, year_short = year -1900,busy_birthday = (births >100))
Then form a new table that only has three columns: the state and your two new columns.
What does the following operation return: select(Birthdays, ends_with("te"))?
select(Birthdays, ends_with("te")) %>%head()
state date
1 AK 1969-01-01
2 AL 1969-01-01
3 AR 1969-01-01
4 AZ 1969-01-01
5 CA 1969-01-01
6 CO 1969-01-01
Second Exercise
Create a table with only births in Massachusetts in 1979, and sort the days from those with the most births to those with the fewest.
MABirths1979 <-filter(Birthdays, state =="MA", year ==1979)MABirths1979Sorted <-arrange(MABirths1979, desc(births))head(MABirths1979Sorted)
state date year births
1 MA 1979-09-28 1979 262
2 MA 1979-09-11 1979 252
3 MA 1979-12-28 1979 249
4 MA 1979-09-26 1979 246
5 MA 1979-07-24 1979 245
6 MA 1979-04-27 1979 243
Third Exercise
Consider the Birthdays data again.
Find the average number of daily births (per state) in each year.
Find the average number of daily births in each year, by state.
BirthdaysYear <-group_by(Birthdays, year)summarise(BirthdaysYear, average =mean(births))
BirthdaysYearState <-group_by(Birthdays, year, state)summarise(BirthdaysYearState, average =mean(births))
# A tibble: 1,020 × 3
# Groups: year [20]
year state average
<int> <chr> <dbl>
1 1969 AK 18.6
2 1969 AL 174.
3 1969 AR 91.3
4 1969 AZ 93.3
5 1969 CA 954.
6 1969 CO 110.
7 1969 CT 134.
8 1969 DC 75.3
9 1969 DE 27.6
10 1969 FL 292.
# … with 1,010 more rows
Piping
QuickMABirths1979 <- Birthdays %>%filter(state =="MA", year ==1979) %>%arrange(desc(births))
With the pipe notation, x %>% f(y) reads as apply function f to the data frame x, and y are additional arguments. Above, x is Birthdays, f is filter(), and y is state == "MA" and year == 1979 .
Make a table showing the five states with the most births between September 9, 1979 and September 11, 1979, inclusive. Arrange the table in descending order of births.