sid | sessionID | grade |
---|---|---|
S31842 | session2207 | B+ |
S32436 | session3172 | S |
S31671 | session3435 | A- |
S31929 | session3512 | NC |
left_join
, inner_join
and full_join
in the dplyr
packagesemi_join
, anti_join
in the dplyr
packageDownload a template .Rmd of this activity. Put the file in a Day_09
folder within your COMP_STAT_112
folder.
A join is a verb that means to combine two data tables.
All joins involve establishing a correspondence — a match — between each case in the left table and zero or more cases in the right table.
The various joins differ in how they handle multiple matches or missing matches.
A match between a case in the left data table and a case in the right data table is made based on the values in keys, variables that uniquely define observations in a data table.
A primary key uniquely identifies an observation in its own table.
A foreign key uniquely identifies an observation in another table.
In order to establish a match between two data tables,
sid | sessionID | grade |
---|---|---|
S31842 | session2207 | B+ |
S32436 | session3172 | S |
S31671 | session3435 | A- |
S31929 | session3512 | NC |
sessionID | dept | level | sem | enroll | iid |
---|---|---|---|---|---|
session2780 | O | 300 | SP2003 | 21 | inst298 |
session3520 | k | 300 | FA2004 | 16 | inst463 |
session1965 | d | 100 | FA2001 | 25 | inst414 |
session3257 | o | 200 | SP2004 | 16 | inst312 |
What variables are the primary keys (uniquely identifies a case) for Grades
?
sid
(student ID) andsessionID
(class ID) are the primary keys forGrades
as they unique identify each case.
What variables are the primary keys (uniquely identifies a case) for Courses
?
sessionID
(class ID) anddept
are the primary keys forCourses
as they unique identify each case. You may have thought thatsessionID
alone was sufficient; however, if a course is cross-listed, then it may have multiple departments listed.
What variables are the foreign keys in Grades
for Courses
?
sessionID
(class ID) is part of a foreign key inGrades
forCourses
. If wegroup_by
andsummarize
first to deal with cross-listed courses, thensessionID
is the foreign key inGrades
forCourses
.
What variables are the foreign keys in Courses
for Grades
?
sessionID
inCourses
is only part of a foreign key that uniquely identify rows inGrades
.
A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
left_join()
: the output has all cases from the left, regardless if there is a match in the right, but discards any cases in the right that do not have a match in the left.
inner_join()
: the output has only the cases from the left with a match in the right.
full_join()
: the output has all cases from the left and the right. This is less common than the first two join operators.
Determine the average class size from the viewpoint of the Provost / Admissions Office
CourseSizes <- Courses %>%
group_by(sessionID) %>% # needed for cross-listed courses
summarise(total_enroll = sum(enroll))
head(CourseSizes)
# A tibble: 6 × 2
sessionID total_enroll
<chr> <dbl>
1 session1784 22
2 session1785 52
3 session1791 22
4 session1792 20
5 session1794 22
6 session1795 26
[1] 21.45251
Determine the average class size from the viewpoint of a student (average class size for each student experiences across their courses)
EnrollmentsWithClassSize <- Grades %>%
left_join(CourseSizes,
by = c("sessionID" = "sessionID")
) %>%
select(sid, sessionID, total_enroll)
head(EnrollmentsWithClassSize)
# A tibble: 6 × 3
sid sessionID total_enroll
<chr> <chr> <dbl>
1 S31185 session1784 22
2 S31185 session1785 52
3 S31185 session1791 22
4 S31185 session1792 20
5 S31185 session1794 22
6 S31185 session1795 26
AveClassEachStudent <- EnrollmentsWithClassSize %>%
group_by(sid) %>%
summarise(ave_enroll = mean(total_enroll, na.rm = TRUE))
head(AveClassEachStudent)
# A tibble: 6 × 2
sid ave_enroll
<chr> <dbl>
1 S31185 29
2 S31188 27.6
3 S31191 29.1
4 S31194 19.5
5 S31197 26.3
6 S31200 26.5
[1] 24.41885
Filtering joins affect the observations, not the variables.
semi_join()
: discards any cases in the left table that do not have a match in the right table. If there are multiple matches of right cases to a left case, it keeps just one copy of the left case.
anti_join()
: discards any cases in the left table that have a match in the right table.
Find a subset of the Grades
data that only contains data on the four largest sections in the Courses
data set.
LargeSections <- Courses %>%
group_by(sessionID) %>%
summarise(total_enroll = sum(enroll)) %>%
arrange(desc(total_enroll)) %>% head(4)
LargeSections
# A tibble: 4 × 2
sessionID total_enroll
<chr> <dbl>
1 session2956 120
2 session2171 105
3 session2465 93
4 session2170 85
# A tibble: 6 × 3
sid sessionID grade
<chr> <chr> <chr>
1 S31188 session2956 S
2 S31344 session2465 S
3 S31365 session2956 S
4 S31389 session2170 A-
5 S31389 session2465 S
6 S31482 session2170 B+
Use semi_join()
to create a table with a subset of the rows of Grades
corresponding to all classes taken in department J
.
Work on Examples 9.1-9.3, then Exercises 9.1-9.6. to turn in for Assignment 6 (Part 1).
(optional) Tidy Tuesday 5
Submit Iterative Viz 0