4 Working with tidy data

These are the course notes for the 2023 version of Fundamentals of Data Science
(MA7419 / MA3419)

4.1 Overview

This week we’ll be examining the concept of tidy data and looking at how to pivot between different data formats.

We’ll also be looking at an important method of searching and manipulating strings called regular expressions.

By the end of this week you’ll be able to:

Define the concept of tidy data and describe when it’s useful
Convert between narrower (tidy) data frames and wider ones using the functions in tidyr
Describe the purpose and structure of a regular expression
Use regular expressions in conjunction with the stringr package

As a special bonus, you’ll be able to use regular expressions for searching inside Microsoft Office documents.

4.2 Tidy data

Tidy data is data where:

Every column is a variable.
Every row is an observation..
Every cell is a single value.

Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis.

However, tidy data isn’t always the best solution. In particular humans usually find wider data frames easier to take in and understand, so for presentation and data entry tasks the wider format is often best.

Here’s an example of a small tidy data set:

# A tibble: 6 × 3
  Name     Test  Score
  <chr>   <int>  <dbl>
1 Tiddles     1  0.726
2 Rover       1 -0.926
3 Tiddles     2 -1.75 
4 Rover       2  0.258
5 Tiddles     3  0.757
6 Rover       3 -0.484

And here’s the same data in a non-tidy (wide) format.

wide_df <- 
  tidy_df |> 
    pivot_wider(names_from = Test, 
                names_prefix = "Test_", 
                values_from = Score)

wide_df

# A tibble: 2 × 4
  Name    Test_1 Test_2 Test_3
  <chr>    <dbl>  <dbl>  <dbl>
1 Tiddles  0.726 -1.75   0.757
2 Rover   -0.926  0.258 -0.484

The functions in the tidyverse (notably ggplot) expect data in a tidy format but the wider format is often easier for people to read or to use when entering data.

As you’ve just seen, the pivot_wider() function takes us from tidy (aka “longer”) format to a wider format. We can go back with pivot_longer(). Notice that the parameters are matched to those in the ‘pivot wider()’ example above.

wide_df |> 
  pivot_longer(cols = !Name,              # Apply to all columns except Name
               names_to = "Test",
               names_prefix = "Test_",    # Remove this from the names.
               values_to = "Score",
               )

# A tibble: 6 × 3
  Name    Test   Score
  <chr>   <chr>  <dbl>
1 Tiddles 1      0.726
2 Tiddles 2     -1.75 
3 Tiddles 3      0.757
4 Rover   1     -0.926
5 Rover   2      0.258
6 Rover   3     -0.484

For more details, see the pivot vignette.

Note: pivot_wider() and pivot_longer() replaced the previous functions spread() and gather(), which you may still see around.

4.3 Regular expressions

Regular expressions can be thought of as a mechanism for advanced search. They provide a language for writing a pattern which is used to match occurrences withing a character string. This ability is made use of by functions in many different languages - in R we’ll mainly use them in conjunction with the ‘stringr’ package.

Here’s a simple example. Suppose the string we want to search is “Paul wrote this in 2020.”

The simplest search would be to see if target contains the string “Paul”, like this.

library(stringr)
target <- "Paul wrote this in 2020."
target |> 
  str_detect(pattern = "Paul")

[1] TRUE

Or we could extract the year contained in the target using a pattern that finds any string of exactly 4 digits.

target |> 
  str_extract(pattern = "\\d{4}")

[1] "2020"

You’ll see that a regular expression (or regex) will often contain a special character like \\d, which matches any digit. The double backslash is a quirk of R. Most implementations of regular expressions only require a single backslash, so if you use any resource not written specifically for R you’ll see \d rather than \\d.

In fact (rather like SQL) there are minor differences in the implementation across different languages: something to bear in mind if you get stuck debugging some regex code.

Our treatment in the lecture is by no means comprehensive - regular expressions can be considered a language in their own right. See:

Regular expressions

My favourite quick reference is:

Regex Cheat Sheet

For stringr see:

Introduction to stringr

The university library has a book called Mastering Regular Expressions (Friedl 2006).

4.4 Reading

R for Data Science (Wickham and Grolemund 2017):

Chapter 12 Tidy data

R Programming for Data Science (Peng 2020):

Chapter 13 Control structures
Chapter 14 Functions
Chapter 17 Regular expressions

Check your understanding

In the regular expression “\\d{4}”, why are there two backslashes?

This is a peculiarity of using regular expressions in R. The first backslash tells R to treat the second backslash as a ‘backslash’ and not a special character.

What pattern would “\\D{4}” match?

This would match any four characters that are NOT digits. (See also w vs W - i.e. lower-case w vs upper-case W).

In the world of regular expressions, what does “greedy” mean?

A greedy expression matches the longest possible pattern it can find in the target. A lazy expression takes the first matching pattern it comes across (i.e. the shortest).

For example, here we try extracting 2 or more digits after the decimal point:

# Greedy...
str_extract("The value of pi is 3.14159", "\\d\\.\\d{2,}")

[1] "3.14159"

# Lazy (note the question mark)...
str_extract("The value of pi is 3.14159", "\\d\\.\\d{2,}?")

[1] "3.14"