class: left, top, title-slide .title[ # Textual Data
Regular Expressions ] .author[ ### Keith VanderLinden
Calvin University ] --- # Regular Expressions: Matches ```r fruit <- c("apple", "banana.", "pear ") ``` .pull-left[ **Basic Matches** ```r str_view(fruit, "an") ``` ``` ## [2] │ b<an><an>a. ``` ```r str_view(fruit, "a\\.") ``` ``` ## [2] │ banan<a.> ``` ] .pull-right[ **Pattern Matches** ```r str_view(fruit, "a.") ``` ``` ## [1] │ <ap>ple ## [2] │ b<an><an><a.> ## [3] │ pe<ar> ``` ```r str_view(fruit, ".r\ ") ``` ``` ## [3] │ pe<ar > ``` ] ??? - We use `\.` as an *escape* for regex period and `\\` for a string slash. This is a double-escape of sorts. --- # Regular Expressions: Anchors ```r fruit <- c("apple", "banana", "pear.") ``` .pull-left[ **Begin String** ```r str_view(fruit, "a") ``` ``` ## [1] │ <a>pple ## [2] │ b<a>n<a>n<a> ## [3] │ pe<a>r. ``` ```r str_view(fruit, "^a") ``` ``` ## [1] │ <a>pple ``` ] .pull-right[ **End String** ```r str_view(fruit, "a") ``` ``` ## [1] │ <a>pple ## [2] │ b<a>n<a>n<a> ## [3] │ pe<a>r. ``` ```r str_view(fruit, "a$") ``` ``` ## [2] │ banan<a> ``` ] ??? --- # Regular Expressions: Character Classes ```r codes <- c("a1", "a 1", "12") ``` .pull-left[ **Digits & White Space** ```r str_view(codes, "\\d") ``` ``` ## [1] │ a<1> ## [2] │ a <1> ## [3] │ <1><2> ``` ```r str_view(codes, "a\\s") ``` ``` ## [2] │ <a >1 ``` ] .pull-right[ **Characters** ```r str_view(codes, "[a-z]") ``` ``` ## [1] │ <a>1 ## [2] │ <a> 1 ``` ```r str_view(codes, "[^a]\\d") ``` ``` ## [2] │ a< 1> ## [3] │ <12> ``` ] ??? White space includes spaces, tables, & newlines. --- # Regular Expressions: Repetition ```r s <- "MDCCCLXXXVIII" ``` .pull-left[ ```r str_view(s, "DC?") ``` ``` ## [1] │ M<DC>CCLXXXVIII ``` ```r str_view(s, "DC+") ``` ``` ## [1] │ M<DCCC>LXXXVIII ``` ```r str_view(s, "D*") ``` ``` ## [1] │ <>M<D><>C<>C<>C<>L<>X<>X<>X<>V<>I<>I<>I<> ``` ] .pull-right[ ```r str_view(s, "C{1}") ``` ``` ## [1] │ MD<C><C><C>LXXXVIII ``` ```r str_view(s, "C{2,}") ``` ``` ## [1] │ MD<CCC>LXXXVIII ``` ```r str_view(s, "[CD]{2,4}") ``` ``` ## [1] │ M<DCCC>LXXXVIII ``` ] ??? - 1888 is the longest year in Roman numerals: MDCCCLXXXVIII. - Operators - `?`: 0 or 1 matches - `+`: 1 or more matches - `*`: 0 or more matches --- # Summary Regular expressions provide a language for specifying string patterns. .footnote[See H. Wickam’s Matching patterns with regular expressions: https://r4ds.had.co.nz/strings.html?q=regular#matching-patterns-with-regular-expressions] ??? - They are critical for text analysis. - Refer back to the regex used in the NLP slides.