5 Joining tables

These are the course notes for the 2023 version of Fundamentals of Data Science
(MA7419 / MA3419)

5.1 Overview

This week we’ll be looking at how to combine different tables of data.

By the end of this chapter you’ll be able to:

Joint data frames using dplyr::bind_rows() or dplyr::bind_cols()
Describe and use the following dplyr mutating joins:

left_join()
right_join()
inner_join()
full_join()

Describe and use the following dplyr filtering joins:

semi_join()
anti_join()

5.2 Joining Tables

It’s a common problem to have to combine data contained in two different tables. (We’ll always assume in R that we are referring to data frames when we talk about a table - although there are other table-like structures)

Binding rows

One simple case is when we have two tables with identical columns and we want to “stick” one onto the bottom of the other. In Excel this is often done by cutting and pasting, which is a very dangerous method since it’s so easy to lose rows.

In R we do this using dplyr::bind_rows(). (This is the tidyverse equivalent of rbind().)

library(dplyr)
library(stringr)
# Split up mtcars
all_cars <- mtcars |> arrange(wt)
big_cars <- filter(all_cars, wt > 3.5 )
little_cars <- filter(all_cars, wt <= 3.5)

# Recombine
combined_rows <- bind_rows(big_cars, little_cars) |> arrange(wt)

# Test for equality
identical(all_cars, combined_rows)

[1] TRUE

A few things to note:

If one of the data frames has a column that isn’t in the other, it will be filled with NAs in the output;
The columns don’t have to be in the same order, they just have to have the same names;
If there are are duplicated rows in the inputs, they will be duplicated in the output (see union() for a way to avoid this if both data frames have identical columns);
bind_rows() can take any number of inputs - contained in a list;
If you are trying to join many data frames (inside a loop for example) it is much faster to collect them in a list and then bind them all at once, than it is to bind each one inside the loop.
You can use the .id parameter to record the source of each row (see ?bind_rows).
The set functions union(), intersect() and setdif() can also be useful.

Binding columns

If we have the same number of rows in two data frames we can add the columns using bind_cols().

# Split up mtcars
cars1 <- all_cars |> select(1:4)
cars2 <- all_cars |> select(-(1:4))

#Recombine
combined_cols <- bind_cols(cars1, cars2)

# Test for equality
identical(all_cars, combined_cols)

[1] TRUE

5.3 Mutating Joins

The following joins allow us to combine the data from two tables, creating extra columns as necessary.

A nice feature of these `join_ functions is that their names and behaviour are similar to analogous functions for joining data in SQL, as we will see later in the programme.

Suppose I have a table containing some details of students registered on a particular module. Perhaps it contains the student ID and a mark for a particular assignment.

set.seed(123)
marks <- 
  tibble(ID = 1:5,
         Score = round(rnorm(5, mean = 65, sd = 10), 1))

marks$ID[5] <- 25
marks

# A tibble: 5 × 2
     ID Score
  <dbl> <dbl>
1     1  59.4
2     2  62.7
3     3  80.6
4     4  65.7
5    25  66.3

I also have a table listing students and whether they are UG or PGT and their email address.

details <- 
  tibble( ID = 1:10,
          Level = sample(c("UG", "PGT"), size = 10, replace = TRUE)) |> 
  mutate(Email = str_c("stu", ID, "@univ.ac.uk"))

is.na(details$Email) <- 3

details

# A tibble: 10 × 3
      ID Level Email           
   <int> <chr> <chr>           
 1     1 PGT   stu1@univ.ac.uk 
 2     2 PGT   stu2@univ.ac.uk 
 3     3 PGT   <NA>            
 4     4 UG    stu4@univ.ac.uk 
 5     5 PGT   stu5@univ.ac.uk 
 6     6 UG    stu6@univ.ac.uk 
 7     7 PGT   stu7@univ.ac.uk 
 8     8 UG    stu8@univ.ac.uk 
 9     9 UG    stu9@univ.ac.uk 
10    10 UG    stu10@univ.ac.uk

Now suppose I want to add the email address to the first data frame. In Excel we could use something like vlookup or index/match. In R we can use a left_join() which takes each row in the second table and adds the columns from the second if it finds a match. In my experience this is the most commonly used join.

marks |> 
  left_join(details, by = "ID")

# A tibble: 5 × 4
     ID Score Level Email          
  <dbl> <dbl> <chr> <chr>          
1     1  59.4 PGT   stu1@univ.ac.uk
2     2  62.7 PGT   stu2@univ.ac.uk
3     3  80.6 PGT   <NA>           
4     4  65.7 UG    stu4@univ.ac.uk
5    25  66.3 <NA>  <NA>

If you don’t want the Level column it can be filtered out after the join.

right_join() includes all the rows in the second table.

marks |> 
  right_join(details, by = "ID")

# A tibble: 10 × 4
      ID Score Level Email           
   <dbl> <dbl> <chr> <chr>           
 1     1  59.4 PGT   stu1@univ.ac.uk 
 2     2  62.7 PGT   stu2@univ.ac.uk 
 3     3  80.6 PGT   <NA>            
 4     4  65.7 UG    stu4@univ.ac.uk 
 5     5  NA   PGT   stu5@univ.ac.uk 
 6     6  NA   UG    stu6@univ.ac.uk 
 7     7  NA   PGT   stu7@univ.ac.uk 
 8     8  NA   UG    stu8@univ.ac.uk 
 9     9  NA   UG    stu9@univ.ac.uk 
10    10  NA   UG    stu10@univ.ac.uk

In this case we end up with 10 rows (as in the details table.)

inner_join() only matches rows that occur in both tables.

marks |> 
  inner_join(details, by = "ID")

# A tibble: 4 × 4
     ID Score Level Email          
  <dbl> <dbl> <chr> <chr>          
1     1  59.4 PGT   stu1@univ.ac.uk
2     2  62.7 PGT   stu2@univ.ac.uk
3     3  80.6 PGT   <NA>           
4     4  65.7 UG    stu4@univ.ac.uk

Finally, full_join() includes all rows in marks OR details.

marks |> 
  full_join(details, by = "ID")

# A tibble: 11 × 4
      ID Score Level Email           
   <dbl> <dbl> <chr> <chr>           
 1     1  59.4 PGT   stu1@univ.ac.uk 
 2     2  62.7 PGT   stu2@univ.ac.uk 
 3     3  80.6 PGT   <NA>            
 4     4  65.7 UG    stu4@univ.ac.uk 
 5    25  66.3 <NA>  <NA>            
 6     5  NA   PGT   stu5@univ.ac.uk 
 7     6  NA   UG    stu6@univ.ac.uk 
 8     7  NA   PGT   stu7@univ.ac.uk 
 9     8  NA   UG    stu8@univ.ac.uk 
10     9  NA   UG    stu9@univ.ac.uk 
11    10  NA   UG    stu10@univ.ac.uk

Note that if the second table (details) contains more than one row that matches with ID in marks there will be more than one corresponding row in the output.

details <- 
  details |> 
  bind_rows(tibble(ID = 4, Level = "PGT", Email = "stu4@univ.ac.uk"))

marks |> 
  left_join(details, by = "ID")

# A tibble: 6 × 4
     ID Score Level Email          
  <dbl> <dbl> <chr> <chr>          
1     1  59.4 PGT   stu1@univ.ac.uk
2     2  62.7 PGT   stu2@univ.ac.uk
3     3  80.6 PGT   <NA>           
4     4  65.7 UG    stu4@univ.ac.uk
5     4  65.7 PGT   stu4@univ.ac.uk
6    25  66.3 <NA>  <NA>

Filtering joins

Sometimes we want to remove rows from the first table, dependent on the contents of the second. For example we can use the information about student level in details to pick out any students with out a matching entry.

marks |> 
  anti_join(details, by = "ID")

# A tibble: 1 × 2
     ID Score
  <dbl> <dbl>
1    25  66.3

Or we can use a list of UG students (in this case generated by filtering details) to pick out the UG students in marks.

marks |> 
  semi_join(details |> filter(Level =="UG"), by = "ID")

# A tibble: 1 × 2
     ID Score
  <dbl> <dbl>
1     4  65.7

5.4 Summary

Mutating joins add columns from y to x, matching rows based on the keys passed with the by = parameter:

inner_join() includes all rows in x and y.

left_join() includes all rows in x.

right_join() includes all rows in y.

full_join() includes all rows in x or y.

If a row in x matches multiple rows in y, all the rows in y will be returned once for each matching row in x.

Filtering joins filter rows from x based on the presence or absence of matches in y:

semi_join() returns all rows from x with a match in y.

anti_join() returns all rows from x without a match in y.

5.5 Reading

R for Data Science (Wickham and Grolemund 2017):

Chapter 13 Relational data

Check your understanding

What would you expect the output from full_join(big_cars, little_cars) to be?

full_join(big_cars, little_cars)

Joining with `by = join_by(mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear,
carb)`

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
2  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
3  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
4  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
5  15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
6  13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
7  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
8  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
9  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
10 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
11 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
12 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
13 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
14 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
15 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
16 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
17 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
18 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
19 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
20 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
21 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
22 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
23 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
24 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
25 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
26 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
27 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
28 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
29 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
30 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
31 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
32 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1