9 APIs

These are the course notes for the 2023 version of Fundamentals of Data Science
(MA7419 / MA3419)

9.1 Overview

This week we’ll be looking at APIs

9.2 Definitions

API

An application programming interface (API) is an interface or communication protocol between different parts of a computer program intended to simplify the implementation and maintenance of software. (Wikipedia)

REST

Representational state transfer (REST) is a software architectural style that defines a set of constraints to be used for creating Web services. (Wikipedia)

Specifically, one of the restful rules is that that you should get data (called a resource) returned when you link to a specific URL.

The URL is called a request and what is sent back is called a response.

You can use restful APIs to send as well as receive data, but we will only look at how to get data.

The API request can be included in a program - so you don’t need a user to click on a download link.

Another piece of jargon is endpoint. This is the base url for the API. This is followed by a path that points to the exact resource.

Finally we can have query parameters. These always begin with a ? and look like:

?query1=param1&query2=param2

where the & separates two query/parameter pairs.

Let’s have an example.

9.3 Example

The endpoint for Github is: https://api.github.com

The path to a specific user’s repos is /users/<username>/repos.

Try copying https://api.github.com/users/vivait/repos into your browser…

you should see information returned in JSON.

But we want to access the data in a program, not via a browser.

The package httr provides tools for HTTP, including the verb GET:

library(dplyr)
library(jsonlite)
library(httr)

github_api <- function(path) {
  url <- modify_url("https://api.github.com", path = path)
  GET(url)
}

resp <- github_api("/users/actuarial-science/repos")

We can use jsonlite to parse the content of the response into a useful R object.

repos <- fromJSON(content(resp, "text"))

We can add some parameters to our query

resp <- github_api("/users/vivait/repos?sort=updated&per_page=100")
repos <- fromJSON(content(resp, "text"))

In fact, if we know the request will return JSON, we can parse it directly with jsonlite. (Not advised in a program.)

For example, the Github documentation says You can issue a GET request to the root endpoint to get all the endpoint categories that the REST API v3 supports:

head(fromJSON("https://api.github.com"), 10)

$current_user_url
[1] "https://api.github.com/user"

$current_user_authorizations_html_url
[1] "https://github.com/settings/connections/applications{/client_id}"

$authorizations_url
[1] "https://api.github.com/authorizations"

$code_search_url
[1] "https://api.github.com/search/code?q={query}{&page,per_page,sort,order}"

$commit_search_url
[1] "https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}"

$emails_url
[1] "https://api.github.com/user/emails"

$emojis_url
[1] "https://api.github.com/emojis"

$events_url
[1] "https://api.github.com/events"

$feeds_url
[1] "https://api.github.com/feeds"

$followers_url
[1] "https://api.github.com/user/followers"

9.4 Twitter example

NOTE the Twitter (X) API examples below, no longer work (thanks Elon)

They will be replaced soon.

This code demonstrates how to use the rtweet package.

For more detail, see https://cran.r-project.org/web/packages/rtweet/vignettes/intro.html.

First you’ll need to set up a developer account with Twitter and get the access keys you need by creating a new app.

Follow the instructions at: https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html.

# library(rtweet)
# ## authenticate - insert your app name and keys below
# token <- create_token(
#   app = "R camlad",
#   consumer_key = api_key,
#   consumer_secret = api_secret_key,
#   access_token = access_token,
#   access_secret = access_token_secret)

Following a hashtag

We can search for tweets including a particular hashtag.

## search for tweets using the Cardano hashtag
# rt <- search_tweets("#Cardano", n = 100, include_rts = FALSE)
# 
# ## preview tweets data
# rt |> select(id, text)

Get a particular user’s timeline

library(stringr)
# tmls <- get_timeline("leicspolice", n = 100)
# 
# tmls |> 
#   select(created_at, text) |> 
#   filter(str_detect(text, 'Traffic'))

9.5 Accessing UK census (and other) data

Our final example demonstrates the NOMIS API, which can be accessed through the nomisr(Odell 2018) package.

A quick demonstration of using `nomisr` to extract data from the Nomis API

This example is based on the nomisr introduction vignette

library(nomisr)

First, we can download information on what data is available.

data_info <- nomis_data_info()
#head(data_info)
glimpse(data_info)

Rows: 1,605
Columns: 14
$ agencyid                             <chr> "NOMIS", "NOMIS", "NOMIS", "NOMIS…
$ id                                   <chr> "NM_1_1", "NM_2_1", "NM_4_1", "NM…
$ uri                                  <chr> "Nm-1d1", "Nm-2d1", "Nm-4d1", "Nm…
$ version                              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ annotations.annotation               <list> [<data.frame[10 x 2]>], [<data.f…
$ components.attribute                 <list> [<data.frame[7 x 4]>], [<data.fr…
$ components.dimension                 <list> [<data.frame[5 x 3]>], [<data.fr…
$ components.primarymeasure.conceptref <chr> "OBS_VALUE", "OBS_VALUE", "OBS_VA…
$ components.timedimension.codelist    <chr> "CL_1_1_TIME", "CL_2_1_TIME", "CL…
$ components.timedimension.conceptref  <chr> "TIME", "TIME", "TIME", "TIME", "…
$ description.value                    <chr> "Records the number of people cla…
$ description.lang                     <chr> "en", "en", "en", "en", "en", "en…
$ name.value                           <chr> "Jobseeker's Allowance with rates…
$ name.lang                            <chr> "en", "en", "en", "en", "en", "en…

There’s a lot here (data_info has 1605 rows). To dig deeper we can search the column description.value or name.value for key words.

pop_data_info <- 
  data_info |> 
  filter(str_detect(name.value, "(?i)population")) |> 
  select(id, name.value)

#pop_data_info |> head()
glimpse(pop_data_info)

Rows: 110
Columns: 2
$ id         <chr> "NM_17_1", "NM_17_5", "NM_31_1", "NM_100_1", "NM_136_1", "N…
$ name.value <chr> "annual population survey", "annual population survey (vari…

Suppose we wanted population data for Leicester. It looks like “NM_31_1” might be worth investigating, so we can dig down deeper.

The data or is categorised first by “concept” (Read the docs at nomis if you want more details.)

id = "NM_31_1"
nomis_get_metadata(id)

# A tibble: 6 × 3
  codelist          conceptref isfrequencydimension
  <chr>             <chr>      <chr>               
1 CL_31_1_GEOGRAPHY GEOGRAPHY  false               
2 CL_31_1_SEX       SEX        false               
3 CL_31_1_AGE       AGE        false               
4 CL_31_1_MEASURES  MEASURES   false               
5 CL_31_1_FREQ      FREQ       true                
6 CL_31_1_TIME      TIME       false

GEOGRAPHY looks relevant, so we explore what “types” are available.

nomis_get_metadata(id, "GEOGRAPHY", type = "type")

# A tibble: 26 × 3
   id      label.en                                               description.en
   <chr>   <chr>                                                  <chr>         
 1 TYPE83  jobcentre plus group as of April 2019                  jobcentre plu…
 2 TYPE84  jobcentre plus district as of April 2019               jobcentre plu…
 3 TYPE342 english index of multiple deprivation 2010 - deciles   english index…
 4 TYPE347 scottish index of multiple deprivation 2009 - deciles  scottish inde…
 5 TYPE349 welsh index of multiple deprivation 2008 - deciles     welsh index o…
 6 TYPE431 local authorities: county / unitary (as of April 2021) local authori…
 7 TYPE432 local authorities: district / unitary (as of April 20… local authori…
 8 TYPE433 local authorities: county / unitary (as of April 2019) local authori…
 9 TYPE434 local authorities: district / unitary (as of April 20… local authori…
10 TYPE442 combined authorities                                   combined auth…
# ℹ 16 more rows

Finally, we can choose a particular type and investigate it.

id |> 
  nomis_get_metadata("GEOGRAPHY", type = "TYPE446") |> 
  filter(str_detect(label.en, "Leicester"))

# A tibble: 2 × 4
  id         parentCode label.en       description.en
  <chr>      <chr>      <chr>          <chr>         
1 1870659636 2013265924 Leicester      Leicester     
2 1870659640 2013265924 Leicestershire Leicestershire

Looks like we’ve found what we want!

leics_pop <- 
  nomis_get_data(id = id, time = "latest",
                 geography = c("1870659636", "1870659640"))

leics_pop |> 
  select(DATE, GEOGRAPHY_NAME, SEX_NAME, AGE_NAME, MEASURES_NAME, OBS_VALUE) |> 
  head(10)

# A tibble: 10 × 6
    DATE GEOGRAPHY_NAME SEX_NAME AGE_NAME           MEASURES_NAME OBS_VALUE
   <dbl> <chr>          <chr>    <chr>              <chr>             <dbl>
 1  2021 Leicester      Male     All ages           Value                NA
 2  2021 Leicester      Male     All ages           Percent              NA
 3  2021 Leicester      Male     Aged under 1 year  Value                NA
 4  2021 Leicester      Male     Aged under 1 year  Percent              NA
 5  2021 Leicester      Male     Aged 1 - 4 years   Value                NA
 6  2021 Leicester      Male     Aged 1 - 4 years   Percent              NA
 7  2021 Leicester      Male     Aged 5 - 9 years   Value                NA
 8  2021 Leicester      Male     Aged 5 - 9 years   Percent              NA
 9  2021 Leicester      Male     Aged 10 - 14 years Value                NA
10  2021 Leicester      Male     Aged 10 - 14 years Percent              NA

9.6 Homework

Install the package randNames and, using the instructions in the package documentation register for a free API key at randomapi.com.

Write a programme to download random data for 400 imaginary users. What is the distribution of genders and country of origin in this data.

Optional Christmas Bonus question

Register an account at Advent of Code. For the 2020 competition solve Question 2. (The key to solving this elegantly is reading the data in and wrangling it into the best format to solve the problem.)