These are the course notes for the 2023 version of Fundamentals of Data Science (MA7419 / MA3419)
5.1 Overview
This week we’ll be looking at how to combine different tables of data.
By the end of this chapter you’ll be able to:
Joint data frames using dplyr::bind_rows() or dplyr::bind_cols()
Describe and use the following dplyr mutating joins:
left_join()
right_join()
inner_join()
full_join()
Describe and use the following dplyr filtering joins:
semi_join()
anti_join()
5.2 Joining Tables
It’s a common problem to have to combine data contained in two different tables. (We’ll always assume in R that we are referring to data frames when we talk about a table - although there are other table-like structures)
Binding rows
One simple case is when we have two tables with identical columns and we want to “stick” one onto the bottom of the other. In Excel this is often done by cutting and pasting, which is a very dangerous method since it’s so easy to lose rows.
In R we do this using dplyr::bind_rows(). (This is the tidyverse equivalent of rbind().)
library(dplyr)library(stringr)# Split up mtcarsall_cars <- mtcars |>arrange(wt)big_cars <-filter(all_cars, wt >3.5 )little_cars <-filter(all_cars, wt <=3.5)# Recombinecombined_rows <-bind_rows(big_cars, little_cars) |>arrange(wt)# Test for equalityidentical(all_cars, combined_rows)
[1] TRUE
A few things to note:
If one of the data frames has a column that isn’t in the other, it will be filled with NAs in the output;
The columns don’t have to be in the same order, they just have to have the same names;
If there are are duplicated rows in the inputs, they will be duplicated in the output (see union() for a way to avoid this if both data frames have identical columns);
bind_rows() can take any number of inputs - contained in a list;
If you are trying to join many data frames (inside a loop for example) it is much faster to collect them in a list and then bind them all at once, than it is to bind each one inside the loop.
You can use the .id parameter to record the source of each row (see ?bind_rows).
The set functions union(), intersect() and setdif() can also be useful.
Binding columns
If we have the same number of rows in two data frames we can add the columns using bind_cols().
# Split up mtcarscars1 <- all_cars |>select(1:4)cars2 <- all_cars |>select(-(1:4))#Recombinecombined_cols <-bind_cols(cars1, cars2)# Test for equalityidentical(all_cars, combined_cols)
[1] TRUE
5.3 Mutating Joins
The following joins allow us to combine the data from two tables, creating extra columns as necessary.
A nice feature of these `join_ functions is that their names and behaviour are similar to analogous functions for joining data in SQL, as we will see later in the programme.
Suppose I have a table containing some details of students registered on a particular module. Perhaps it contains the student ID and a mark for a particular assignment.
set.seed(123)marks <-tibble(ID =1:5,Score =round(rnorm(5, mean =65, sd =10), 1))marks$ID[5] <-25marks
# A tibble: 10 × 3
ID Level Email
<int> <chr> <chr>
1 1 PGT stu1@univ.ac.uk
2 2 PGT stu2@univ.ac.uk
3 3 PGT <NA>
4 4 UG stu4@univ.ac.uk
5 5 PGT stu5@univ.ac.uk
6 6 UG stu6@univ.ac.uk
7 7 PGT stu7@univ.ac.uk
8 8 UG stu8@univ.ac.uk
9 9 UG stu9@univ.ac.uk
10 10 UG stu10@univ.ac.uk
Now suppose I want to add the email address to the first data frame. In Excel we could use something like vlookup or index/match. In R we can use a left_join() which takes each row in the second table and adds the columns from the second if it finds a match. In my experience this is the most commonly used join.
If you don’t want the Level column it can be filtered out after the join.
right_join() includes all the rows in the second table.
marks |>right_join(details, by ="ID")
# A tibble: 10 × 4
ID Score Level Email
<dbl> <dbl> <chr> <chr>
1 1 59.4 PGT stu1@univ.ac.uk
2 2 62.7 PGT stu2@univ.ac.uk
3 3 80.6 PGT <NA>
4 4 65.7 UG stu4@univ.ac.uk
5 5 NA PGT stu5@univ.ac.uk
6 6 NA UG stu6@univ.ac.uk
7 7 NA PGT stu7@univ.ac.uk
8 8 NA UG stu8@univ.ac.uk
9 9 NA UG stu9@univ.ac.uk
10 10 NA UG stu10@univ.ac.uk
In this case we end up with 10 rows (as in the details table.)
inner_join() only matches rows that occur in both tables.
Finally, full_join() includes all rows in marks OR details.
marks |>full_join(details, by ="ID")
# A tibble: 11 × 4
ID Score Level Email
<dbl> <dbl> <chr> <chr>
1 1 59.4 PGT stu1@univ.ac.uk
2 2 62.7 PGT stu2@univ.ac.uk
3 3 80.6 PGT <NA>
4 4 65.7 UG stu4@univ.ac.uk
5 25 66.3 <NA> <NA>
6 5 NA PGT stu5@univ.ac.uk
7 6 NA UG stu6@univ.ac.uk
8 7 NA PGT stu7@univ.ac.uk
9 8 NA UG stu8@univ.ac.uk
10 9 NA UG stu9@univ.ac.uk
11 10 NA UG stu10@univ.ac.uk
Note that if the second table (details) contains more than one row that matches with ID in marks there will be more than one corresponding row in the output.
Sometimes we want to remove rows from the first table, dependent on the contents of the second. For example we can use the information about student level in details to pick out any students with out a matching entry.
marks |>anti_join(details, by ="ID")
# A tibble: 1 × 2
ID Score
<dbl> <dbl>
1 25 66.3
Or we can use a list of UG students (in this case generated by filtering details) to pick out the UG students in marks.
marks |>semi_join(details |>filter(Level =="UG"), by ="ID")
# A tibble: 1 × 2
ID Score
<dbl> <dbl>
1 4 65.7
5.4 Summary
Mutating joins add columns from y to x, matching rows based on the keys passed with the by = parameter:
inner_join() includes all rows in x and y.
left_join() includes all rows in x.
right_join() includes all rows in y.
full_join() includes all rows in x or y.
If a row in x matches multiple rows in y, all the rows in y will be returned once for each matching row in x.
Filtering joins filter rows from x based on the presence or absence of matches in y:
semi_join() returns all rows from x with a match in y.
anti_join() returns all rows from x without a match in y.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media. http://r4ds.had.co.nz/.