Topic 14 Regular Expressions

Learning Goals

  • Develop comfort in working with strings of text data
  • Use regular expressions to search and replace, detect patterns, locate patterns, extract patterns, and separate text with the stringr package.

Create a new Rmd file (save it as 14-Regex.Rmd). Put this file in a folder Assignment_09 in your COMP_STAT_112 folder.

Regular Expressions and Character Strings

Regular expressions allow us to describe character patterns. Regular expressions allow us to:

  • Search for particular items within a large body of text. For example, you may wish to identify and extract all email addresses.
  • Replace particular items. For example, you may wish to clean up some poorly formatted HTML by replacing all uppercase tags with lowercase equivalents.
  • Validate input. For example, you may want to check that a password meets certain criteria such as, a mix of uppercase and lowercase, digits and punctuation.
  • Coordinate actions. For example, you may wish to process certain files in a directory, but only if they meet particular conditions.
  • Reformat text. For example, you may want to split strings into different parts, each to form new variables.
  • and more…

Start by doing this interactive tutorial. Note that neither the tutorial nor regular expressions more generally are specific to R. Some of the syntax in the tutorial is slightly different from what we’ll use in R, but it will still help you get acclimated to the main ideas of regular expressions.

Wrangling with Regular Expressions in R

Now that we have some idea how regular expressions work, let’s examine how to use them to achieve various tasks in R. It will be helpful to have your cheat sheet handy. Many of these tasks can either be accomplished with functions from the base (built-in) package in R or from the stringr package, which is part of the Tidyverse. In general, the stringr functions are faster, which will be noticeable when processing a large amount of text data.

example <- "The quick brown fox jumps over the lazy dog."
example2 <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example3 <- "This is a test"

Search and replace patterns with str_replace or str_replace_all (stringr)

To search for a pattern and replace it, we can use the function str_replace and str_replace_all in the stringr package. Note that str_replace only replaces the first matched pattern, while str_replace_all replaces all. Here are some examples:

str_replace(example, "quick", "really quick")
## [1] "The really quick brown fox jumps over the lazy dog."
str_replace_all(example, "(fox|dog)", "****") # | reads as OR
## [1] "The quick brown **** jumps over the lazy ****."
str_replace_all(example, "(fox|dog).", "****") # "." for any character
## [1] "The quick brown ****jumps over the lazy ****"
str_replace_all(example, "(fox|dog)\\.$", "****") # at end of sentence only, "\\." only for a period
## [1] "The quick brown fox jumps over the lazy ****"
str_replace_all(example, "the", "a") # case-sensitive only matches one
## [1] "The quick brown fox jumps over a lazy dog."
str_replace_all(example, "[Tt]he", "a") # # will match either t or T; could also make "a" conditional on capitalization of t
## [1] "a quick brown fox jumps over a lazy dog."
str_replace_all(example, "[Tt]he", "a") # first match only
## [1] "a quick brown fox jumps over a lazy dog."

Detect patterns with str_detect (stringr)

example2 <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example3 <- "This is a test"
examples <- c(example, example2, example3)

pat <- "[^aeiouAEIOU ]{3}" # Regular expression for three straight consonants. Note that I've excluded spaces as well

str_detect(examples, pat) # TRUE/FALSE if it detects pattern
## [1]  TRUE  TRUE FALSE
str_subset(examples, pat) # Pulls out those that detects pattern
## [1] "The quick brown fox jumps over the lazy dog."                                                                                                        
## [2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"

Locate patterns with str_locate

str_locate(example, pat) # starting position and ending position of first match
##      start end
## [1,]    23  25

Let’s check the answer:

str_sub(example, 23, 25)
## [1] "mps"

Extract patterns with str_extract and str_extract_all

pat2 <- "[^aeiouAEIOU ][aeiouAEIOU]{2}[^aeiouAEIOU ]{1}" # consonant followed by two vowels followed by a consonant
str_extract(example2, pat2) # extract first match
## [1] "road"
str_extract_all(example2, pat2, simplify = TRUE) # extract all matches
##      [,1]   [,2]   [,3]   [,4]   [,5]   [,6]  
## [1,] "road" "wood" "coul" "tood" "look" "coul"

Count the number of characters with str_length

str_length(example2)
## [1] 148

Convert a string to all lower case letters with str_to_lower

str_to_lower(example2)
## [1] "two roads diverged in a yellow wood, / and sorry i could not travel both / and be one traveler, long i stood / and looked down one as far as i could"

Split strings with separate

df <- tibble(ex = example2)
df <- separate(df, ex, c("line1", "line2", "line3", "line4"), sep = " / ")
df$line1
## [1] "Two roads diverged in a yellow wood,"
df$line2
## [1] "And sorry I could not travel both"
df$line3
## [1] "And be one traveler, long I stood"
df$line4
## [1] "And looked down one as far as I could"

Note: The function separate is in the tidyr package.

Homework 9: due Wednesday, April 5th @ 11:59pm

The tibble courses has the Spring 2023 enrollment information from the Macalester Registrar’s website, which we could gain with web scraping tools.

library(purrr)
library(rvest)
library(lubridate)

spring2023 <- read_html("https://www.macalester.edu/registrar/schedules/2022fall/class-schedule")

# Retrieve and inspect course numbers
course_nums <-
  spring2023 %>%
  html_nodes(".class-schedule-course-number") %>%
  html_text()

# Retrieve and inspect course names
course_names <-
  spring2023 %>%
  html_nodes(".class-schedule-course-title") %>%
  html_text()

course_nums_clean <- stringr::str_sub(course_nums, end = nchar(course_nums) - 6)

crn <- stringr::str_sub(course_nums, start = nchar(course_nums) - 4)

course_instructors <-
  spring2023 %>%
  html_nodes(".class-schedule-label:nth-child(6)") %>%
  html_text()
course_instructors_short <- stringr::str_sub(trimws(course_instructors), start = 13)

course_days <-
  spring2023 %>%
  html_nodes(".class-schedule-label:nth-child(3)") %>%
  html_text()
course_days_short <- trimws(stringr::str_sub(trimws(course_days), start = 7))

course_times <-
  spring2023 %>%
  html_nodes(".class-schedule-label:nth-child(4)") %>%
  html_text()
course_times_short <- stringr::str_sub(trimws(course_times), start = 7)

course_rooms <-
  spring2023 %>%
  html_nodes(".class-schedule-label:nth-child(5)") %>%
  html_text()
course_rooms_short <- stringr::str_sub(trimws(course_rooms), start = 7)

course_avail <-
  spring2023 %>%
  html_nodes(".class-schedule-label:nth-child(7)") %>%
  html_text()
course_avail_short <- stringr::str_sub(trimws(course_avail), start = 14)


courses <-
  tibble(
    number = course_nums_clean,
    crn = crn,
    name = course_names,
    days = course_days_short,
    time = course_times_short,
    room = course_rooms_short,
    instructor = course_instructors_short,
    avail_max = course_avail_short
  )

write_csv(courses, "courses.csv")

Exercise 14.1 (Rearrange data table) Make the following changes to the courses data table (save updated table as courses):

  1. Split number into three separate columns: dept, number, and section.
  2. Split the avail_max variable into two separate variables: avail and max. It might be helpful to first remove all appearances of “Closed”.
  3. Use avail and max to generate a new variable called enrollment.
  4. Split the time variable into two separate columns: start_time and end_time. Convert all of these times into continuous 24 hour times (e.g., 2:15 pm should become 14.25). Hint: check out the function lubridate::parse_date_time() to convert to a date-time object. Use some functions in lubridate to extract the hours and minutes. Then, convert the columns using some mathematical functions.

Exercise 14.2 (Filter cases) Remove all of the following from the data table (save updated table as courses2):

  1. All sections in PE or INTD.
  2. All music ensembles and dance practicum sections (these are all of the MUSI and THDA classes with numbers less than 100).
  3. All lab sections. (This is one is a bit tricky. You cant simply eliminate all instances of “Lab” in the name, because that would eliminate courses like “Economic Analysis of Labor”. My suggestion: filter out sections with any instances of “L” and names which contain “Lab”.)

Exercise 14.3 (Handle cross-listed courses) Some sections are listed under multiple different departments, and you will find the same instructor, time, enrollment data, etc. For this activity, we only want to include each actual section once and it doesn’t really matter which department code we associate with this section. Eliminate all duplicated cross-listed courses from courses2, keeping each actual section just once (save updated table as courses2). Hint: look into the R command distinct, and think carefully about how to find duplicates. Be sure to look into the .keep_all argument.

Exercise 14.4 (Faculty enrollments) Using courses2, make a table where each row contains a faculty, the number of sections they are teaching in Spring 2023, and the total enrollments in those section. Sort the table from highest total enrollments to lowest, and display the top 6.

For the purposes of this exercise, we are just going to leave co-taught courses as is so that you will have an extra row for each pair or triplet of instructors. Alternatives would be to allocate the enrollment number to each of the faculty members or to split it up between the members. The first option would usually be the most appropriate, although it might depend on the course.

Exercise 14.5 (Evening courses) Create and display a new table with all night courses (i.e., a subset of courses2). Also make a bar plot showing the number of these courses by day of the week.