# A tibble: 6 × 3
Name Test Score
<chr> <int> <dbl>
1 Tiddles 1 0.726
2 Rover 1 -0.926
3 Tiddles 2 -1.75
4 Rover 2 0.258
5 Tiddles 3 0.757
6 Rover 3 -0.484
4 Working with tidy data
(MA7419 / MA3419)
4.1 Overview
This week we’ll be examining the concept of tidy data and looking at how to pivot between different data formats.
We’ll also be looking at an important method of searching and manipulating strings called regular expressions.
By the end of this week you’ll be able to:
- Define the concept of tidy data and describe when it’s useful
- Convert between narrower (tidy) data frames and wider ones using the functions in
tidyr
- Describe the purpose and structure of a regular expression
- Use regular expressions in conjunction with the
stringr
package
As a special bonus, you’ll be able to use regular expressions for searching inside Microsoft Office documents.
4.2 Tidy data
Tidy data is data where:
- Every column is a variable.
- Every row is an observation..
- Every cell is a single value.
Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis.
However, tidy data isn’t always the best solution. In particular humans usually find wider data frames easier to take in and understand, so for presentation and data entry tasks the wider format is often best.
Here’s an example of a small tidy data set:
And here’s the same data in a non-tidy (wide) format.
<-
wide_df |>
tidy_df pivot_wider(names_from = Test,
names_prefix = "Test_",
values_from = Score)
wide_df
# A tibble: 2 × 4
Name Test_1 Test_2 Test_3
<chr> <dbl> <dbl> <dbl>
1 Tiddles 0.726 -1.75 0.757
2 Rover -0.926 0.258 -0.484
The functions in the tidyverse
(notably ggplot
) expect data in a tidy format but the wider format is often easier for people to read or to use when entering data.
As you’ve just seen, the pivot_wider()
function takes us from tidy (aka “longer”) format to a wider format. We can go back with pivot_longer()
. Notice that the parameters are matched to those in the ‘pivot wider()’ example above.
|>
wide_df pivot_longer(cols = !Name, # Apply to all columns except Name
names_to = "Test",
names_prefix = "Test_", # Remove this from the names.
values_to = "Score",
)
# A tibble: 6 × 3
Name Test Score
<chr> <chr> <dbl>
1 Tiddles 1 0.726
2 Tiddles 2 -1.75
3 Tiddles 3 0.757
4 Rover 1 -0.926
5 Rover 2 0.258
6 Rover 3 -0.484
For more details, see the pivot vignette.
Note: pivot_wider()
and pivot_longer()
replaced the previous functions spread()
and gather()
, which you may still see around.
4.3 Regular expressions
Regular expressions can be thought of as a mechanism for advanced search. They provide a language for writing a pattern which is used to match occurrences withing a character string. This ability is made use of by functions in many different languages - in R we’ll mainly use them in conjunction with the ‘stringr’ package.
Here’s a simple example. Suppose the string we want to search is “Paul wrote this in 2020.”
The simplest search would be to see if target
contains the string “Paul”, like this.
library(stringr)
<- "Paul wrote this in 2020."
target |>
target str_detect(pattern = "Paul")
[1] TRUE
Or we could extract the year contained in the target using a pattern that finds any string of exactly 4 digits.
|>
target str_extract(pattern = "\\d{4}")
[1] "2020"
You’ll see that a regular expression (or regex) will often contain a special character like \\d
, which matches any digit. The double backslash is a quirk of R. Most implementations of regular expressions only require a single backslash, so if you use any resource not written specifically for R you’ll see \d
rather than \\d
.
In fact (rather like SQL) there are minor differences in the implementation across different languages: something to bear in mind if you get stuck debugging some regex code.
Our treatment in the lecture is by no means comprehensive - regular expressions can be considered a language in their own right. See:
My favourite quick reference is:
For stringr
see:
The university library has a book called Mastering Regular Expressions (Friedl 2006).
4.4 Reading
R for Data Science (Wickham and Grolemund 2017):
- Chapter 12 Tidy data
R Programming for Data Science (Peng 2020):
- Chapter 13 Control structures
- Chapter 14 Functions
- Chapter 17 Regular expressions