Topic 14 Regular Expressions
Learning Goals
- Develop comfort in working with strings of text data
- Use regular expressions to search and replace, detect patterns, locate patterns, extract patterns, and separate text with the
stringr
package.
Create a new Rmd file (save it as 14-Regex.Rmd). Put this file in a folder Assignment_09
in your COMP_STAT_112
folder.
Regular Expressions and Character Strings
Regular expressions allow us to describe character patterns. Regular expressions allow us to:
- Search for particular items within a large body of text. For example, you may wish to identify and extract all email addresses.
- Replace particular items. For example, you may wish to clean up some poorly formatted HTML by replacing all uppercase tags with lowercase equivalents.
- Validate input. For example, you may want to check that a password meets certain criteria such as, a mix of uppercase and lowercase, digits and punctuation.
- Coordinate actions. For example, you may wish to process certain files in a directory, but only if they meet particular conditions.
- Reformat text. For example, you may want to split strings into different parts, each to form new variables.
- and more…
Start by doing this interactive tutorial. Note that neither the tutorial nor regular expressions more generally are specific to R
. Some of the syntax in the tutorial is slightly different from what we’ll use in R
, but it will still help you get acclimated to the main ideas of regular expressions.
Wrangling with Regular Expressions in R
Now that we have some idea how regular expressions work, let’s examine how to use them to achieve various tasks in R
. It will be helpful to have your cheat sheet handy. Many of these tasks can either be accomplished with functions from the base
(built-in) package in R
or from the stringr
package, which is part of the Tidyverse. In general, the stringr
functions are faster, which will be noticeable when processing a large amount of text data.
<- "The quick brown fox jumps over the lazy dog."
example <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example2 <- "This is a test" example3
Search and replace patterns with str_replace
or str_replace_all
(stringr
)
To search for a pattern and replace it, we can use the function str_replace
and str_replace_all
in the stringr
package. Note that str_replace
only replaces the first matched pattern, while str_replace_all
replaces all. Here are some examples:
str_replace(example, "quick", "really quick")
## [1] "The really quick brown fox jumps over the lazy dog."
str_replace_all(example, "(fox|dog)", "****") # | reads as OR
## [1] "The quick brown **** jumps over the lazy ****."
str_replace_all(example, "(fox|dog).", "****") # "." for any character
## [1] "The quick brown ****jumps over the lazy ****"
str_replace_all(example, "(fox|dog)\\.$", "****") # at end of sentence only, "\\." only for a period
## [1] "The quick brown fox jumps over the lazy ****"
str_replace_all(example, "the", "a") # case-sensitive only matches one
## [1] "The quick brown fox jumps over a lazy dog."
str_replace_all(example, "[Tt]he", "a") # # will match either t or T; could also make "a" conditional on capitalization of t
## [1] "a quick brown fox jumps over a lazy dog."
str_replace_all(example, "[Tt]he", "a") # first match only
## [1] "a quick brown fox jumps over a lazy dog."
Detect patterns with str_detect
(stringr
)
<- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example2 <- "This is a test"
example3 <- c(example, example2, example3)
examples
<- "[^aeiouAEIOU ]{3}" # Regular expression for three straight consonants. Note that I've excluded spaces as well
pat
str_detect(examples, pat) # TRUE/FALSE if it detects pattern
## [1] TRUE TRUE FALSE
str_subset(examples, pat) # Pulls out those that detects pattern
## [1] "The quick brown fox jumps over the lazy dog."
## [2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
Locate patterns with str_locate
str_locate(example, pat) # starting position and ending position of first match
## start end
## [1,] 23 25
Let’s check the answer:
str_sub(example, 23, 25)
## [1] "mps"
Extract patterns with str_extract
and str_extract_all
<- "[^aeiouAEIOU ][aeiouAEIOU]{2}[^aeiouAEIOU ]{1}" # consonant followed by two vowels followed by a consonant
pat2 str_extract(example2, pat2) # extract first match
## [1] "road"
str_extract_all(example2, pat2, simplify = TRUE) # extract all matches
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "road" "wood" "coul" "tood" "look" "coul"
Convert a string to all lower case letters with str_to_lower
str_to_lower(example2)
## [1] "two roads diverged in a yellow wood, / and sorry i could not travel both / and be one traveler, long i stood / and looked down one as far as i could"
Split strings with separate
<- tibble(ex = example2)
df <- separate(df, ex, c("line1", "line2", "line3", "line4"), sep = " / ")
df $line1 df
## [1] "Two roads diverged in a yellow wood,"
$line2 df
## [1] "And sorry I could not travel both"
$line3 df
## [1] "And be one traveler, long I stood"
$line4 df
## [1] "And looked down one as far as I could"
Note: The function separate
is in the tidyr
package.
Homework 9: due Wednesday, April 5th @ 11:59pm
The tibble courses
has the Spring 2023 enrollment information from the Macalester Registrar’s website, which we could gain with web scraping tools.
library(purrr)
library(rvest)
library(lubridate)
<- read_html("https://www.macalester.edu/registrar/schedules/2022fall/class-schedule")
spring2023
# Retrieve and inspect course numbers
<-
course_nums %>%
spring2023 html_nodes(".class-schedule-course-number") %>%
html_text()
# Retrieve and inspect course names
<-
course_names %>%
spring2023 html_nodes(".class-schedule-course-title") %>%
html_text()
<- stringr::str_sub(course_nums, end = nchar(course_nums) - 6)
course_nums_clean
<- stringr::str_sub(course_nums, start = nchar(course_nums) - 4)
crn
<-
course_instructors %>%
spring2023 html_nodes(".class-schedule-label:nth-child(6)") %>%
html_text()
<- stringr::str_sub(trimws(course_instructors), start = 13)
course_instructors_short
<-
course_days %>%
spring2023 html_nodes(".class-schedule-label:nth-child(3)") %>%
html_text()
<- trimws(stringr::str_sub(trimws(course_days), start = 7))
course_days_short
<-
course_times %>%
spring2023 html_nodes(".class-schedule-label:nth-child(4)") %>%
html_text()
<- stringr::str_sub(trimws(course_times), start = 7)
course_times_short
<-
course_rooms %>%
spring2023 html_nodes(".class-schedule-label:nth-child(5)") %>%
html_text()
<- stringr::str_sub(trimws(course_rooms), start = 7)
course_rooms_short
<-
course_avail %>%
spring2023 html_nodes(".class-schedule-label:nth-child(7)") %>%
html_text()
<- stringr::str_sub(trimws(course_avail), start = 14)
course_avail_short
<-
courses tibble(
number = course_nums_clean,
crn = crn,
name = course_names,
days = course_days_short,
time = course_times_short,
room = course_rooms_short,
instructor = course_instructors_short,
avail_max = course_avail_short
)
write_csv(courses, "courses.csv")
Exercise 14.1 (Rearrange data table) Make the following changes to the courses
data table (save updated table as courses
):
- Split
number
into three separate columns:dept
,number
, andsection
. - Split the
avail_max
variable into two separate variables:avail
andmax
. It might be helpful to first remove all appearances of “Closed”. - Use
avail
andmax
to generate a new variable calledenrollment
. - Split the
time
variable into two separate columns:start_time
andend_time
. Convert all of these times into continuous 24 hour times (e.g., 2:15 pm should become 14.25). Hint: check out the functionlubridate::parse_date_time()
to convert to a date-time object. Use some functions inlubridate
to extract the hours and minutes. Then, convert the columns using some mathematical functions.
Exercise 14.2 (Filter cases) Remove all of the following from the data table (save updated table as courses2
):
- All sections in
PE
orINTD
. - All music ensembles and dance practicum sections (these are all of the MUSI and THDA classes with numbers less than 100).
- All lab sections. (This is one is a bit tricky. You cant simply eliminate all instances of “Lab” in the name, because that would eliminate courses like “Economic Analysis of Labor”. My suggestion: filter out sections with any instances of “L” and names which contain “Lab”.)
Exercise 14.3 (Handle cross-listed courses) Some sections are listed under multiple different departments, and you will find the same instructor, time, enrollment data, etc. For this activity, we only want to include each actual section once and it doesn’t really matter which department code we associate with this section. Eliminate all duplicated cross-listed courses from courses2
, keeping each actual section just once (save updated table as courses2
). Hint: look into the R
command distinct
, and think carefully about how to find duplicates. Be sure to look into the .keep_all argument.
Exercise 14.4 (Faculty enrollments) Using courses2
, make a table where each row contains a faculty, the number of sections they are teaching in Spring 2023, and the total enrollments in those section. Sort the table from highest total enrollments to lowest, and display the top 6.
For the purposes of this exercise, we are just going to leave co-taught courses as is so that you will have an extra row for each pair or triplet of instructors. Alternatives would be to allocate the enrollment number to each of the faculty members or to split it up between the members. The first option would usually be the most appropriate, although it might depend on the course.
Exercise 14.5 (Evening courses) Create and display a new table with all night courses (i.e., a subset of courses2
). Also make a bar plot showing the number of these courses by day of the week.