Regular Expressions/Cleaning Text Data

James Normington

Announcements

Regular Expressions

Regular expressions allow us to describe character patterns.

After class, try: Interactive Regex Tutorial

Text Examples

(example <- "The quick brown fox jumps over the lazy dog.")
[1] "The quick brown fox jumps over the lazy dog."



We’ll practice:

  • Replacing text patterns
  • Detecting text patterns
  • Locating text patterns
  • Changing case
  • Separate/split text

Search and replace patterns

To search for a pattern and replace it, we can use the functions str_replace and str_replace_all.


example
[1] "The quick brown fox jumps over the lazy dog."
str_replace(example, pattern = "quick", replacement = "really quick")
[1] "The really quick brown fox jumps over the lazy dog."
str_replace_all(example, pattern = "(fox|dog)",  replacement = "****") 
[1] "The quick brown **** jumps over the lazy ****."
str_replace_all(example, "(fox|dog).", "****") # "." for any character
[1] "The quick brown ****jumps over the lazy ****"
str_replace_all(example, "(fox|dog)\\.$", "****") # at end of sentence only, "\\." only for a period
[1] "The quick brown fox jumps over the lazy ****"
str_replace(example, "[Tt]he", "a") # only first match
[1] "a quick brown fox jumps over the lazy dog."
str_replace_all(example, "[Tt]he", "a") # all matches
[1] "a quick brown fox jumps over a lazy dog."

Detect patterns

example2 <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example3 <- "This is a test"
(examples <- c(example, example2, example3))
[1] "The quick brown fox jumps over the lazy dog."                                                                                                        
[2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
[3] "This is a test"                                                                                                                                      
pat <- "[^aeiouAEIOU ]{3}" # Regular expression for three straight consonants. Note that I've excluded spaces as well
str_detect(examples, pat) 
[1]  TRUE  TRUE FALSE
str_subset(examples, pat)
[1] "The quick brown fox jumps over the lazy dog."                                                                                                        
[2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"

Locate patterns

example
[1] "The quick brown fox jumps over the lazy dog."
str_locate(example, pat) # starting position and ending position of first match
     start end
[1,]    23  25

Let’s check the answer:

str_sub(example, 23, 25)
[1] "mps"

Extract patterns

example2
[1] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
pat2 <- "[^aeiouAEIOU ][aeiouAEIOU]{2}[^aeiouAEIOU ]{1}" # consonant followed by two vowels followed by a consonant

str_extract(example2, pat2) # extract first match
[1] "road"
str_extract_all(example2, pat2, simplify = TRUE) # extract all matches
     [,1]   [,2]   [,3]   [,4]   [,5]   [,6]  
[1,] "road" "wood" "coul" "tood" "look" "coul"

Count the number of characters

example2
[1] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
str_length(example2)
[1] 148

Change case

example2
[1] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
str_to_lower(example2)
[1] "two roads diverged in a yellow wood, / and sorry i could not travel both / and be one traveler, long i stood / and looked down one as far as i could"

Split strings

df <- tibble(ex = example2)
df <- df %>% separate(ex, c("line1", "line2", "line3", "line4"), sep = " / ")
df
# A tibble: 1 × 4
  line1                                line2                         line3 line4
  <chr>                                <chr>                         <chr> <chr>
1 Two roads diverged in a yellow wood, And sorry I could not travel… And … And …

Practice

Go to our course website and download the Rmd template file.

After Class

Regular Expressions

Other Assignments

  • Iterative Viz
  • Continue thinking about Final Project