7  Some other data structures

These are the course notes for the 2023 version of Fundamentals of Data Science
(MA7419 / MA3419)

7.1 Overview

This week we’ll be covering different data structures:

  • XML
  • JSON
  • unstructured (or semi-structured) text

7.2 XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.

XML can serve as the basis for defining markup languages for particular domains. For example XBRL (Extensible Business Reporting Language), KML (Keyhole Markup Language for geographic information), BeerXML (you guessed it).

Here is a simple example containing some data about my pets.

<pets>
  <pet id = '001' species = 'dog'>
    <tag>Rover</tag>
    <colour>black</colour>
  </pet>
  <pet id = '002' species = 'cat'>
    <tag>Tiddles</tag>
    <colour>ginger</colour>
  </pet>
  <pet id = '003' species = 'dog'>
    <tag>Fido</tag>
    <colour>brownish</colour>
  </pet>
</pets>

Even if you knew nothing about XML before, you can work out what is going on.

Package xml2

The package xml2 gives you tools to read an XML file and extract the data.

https://blog.rstudio.com/2015/04/21/xml2/

Normally you would read the data from an external file, but for a very small example I’ve saved the data above as a string called my_pets.

Searching

We can navigate and search through the tree with more precision using XPath.

xml_find_first(xpets, '//pet[@species="cat"]') # Find the first cat
{xml_node}
<pet id="002" species="cat">
[1] <tag>Tiddles</tag>
[2] <colour>ginger</colour>
xpets |> 
      xml_find_all("//pet[@species='dog']") |> # List the dogs' names
      xml_find_all(".//tag") |>                # Note the important .
      xml_text()
[1] "Rover" "Fido" 
xpets |>
  xml_find_all(".//pet[@species='dog']") |>
  xml_attr("id")
[1] "001" "003"

Create a data frame

By extracting each quantity we want separately we can put together a data frame.

pet_id <- 
xpets |>
  xml_find_all(".//pet") |>
  xml_attr("id")
pet_species <- 
xpets |>
  xml_find_all(".//pet") |>
  xml_attr("species")
pet_name <- 
  xpets |> 
  xml_find_all(".//pet/tag") |> 
  xml_text()
pet_colour <- 
  xpets |> 
  xml_find_all(".//pet/colour") |> 
  xml_text()
dfpets <- 
  data.frame(ID = pet_id,
             Name = pet_name,
             Species = pet_species,
             Colour = pet_colour)
dfpets
   ID    Name Species   Colour
1 001   Rover     dog    black
2 002 Tiddles     cat   ginger
3 003    Fido     dog brownish

7.3 JSON

JSON is a syntax for storing and exchanging data. It’s lightweight, human readable, language-independent and very widely used. Most programming languages can process JSON.

To see what JSON looks like we’ll use the jsonlite package to convert our pets data from a data frame to JSON.

library(jsonlite)
jpets <- toJSON(dfpets)
jpets
[{"ID":"001","Name":"Rover","Species":"dog","Colour":"black"},{"ID":"002","Name":"Tiddles","Species":"cat","Colour":"ginger"},{"ID":"003","Name":"Fido","Species":"dog","Colour":"brownish"}] 

JSON stands for JavaScript Object Notation. The text inside each set of curly brackets represents a JavaScript object - but it is just text and can be taken to JSON is independent of JavaScript. You can think of each object as being like a card in an old-fashioned card index system.

A man working on the card index

We can make it easier to read:

prettify((jpets))
[
    {
        "ID": "001",
        "Name": "Rover",
        "Species": "dog",
        "Colour": "black"
    },
    {
        "ID": "002",
        "Name": "Tiddles",
        "Species": "cat",
        "Colour": "ginger"
    },
    {
        "ID": "003",
        "Name": "Fido",
        "Species": "dog",
        "Colour": "brownish"
    }
]
 

jsonlitehas a function to convert to a data frame.

fromJSON(jpets)
   ID    Name Species   Colour
1 001   Rover     dog    black
2 002 Tiddles     cat   ginger
3 003    Fido     dog brownish

7.4 Text processing

Text processing (a branch of Natural Language Processing, or NLP) is a big topic and we can only scratch the surface in this module.

In Wednesday’s class we will look at an example of sentiment analysis.

We’ll use a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011.(Nielsen 2011) but I have removed profanities from the list because I used it for a presentation in a school.

We’ll also use some classic texts downloaded from Project Gutenburg.

Please see the Further Reading for more information on text processing (and you’ll probably want to refresh your regex skills).

7.5 Reading

7.6 Further reading

Text Mining with R: a tidy approach (Julia and David 2017) is an excellent introduction, compatible with the methods used in this course.

Check your understanding

jstarwars <- toJSON(starwars, pretty = TRUE)
dfstarwars <- fromJSON(jstarwars)

identical(starwars, dfstarwars)
[1] FALSE

identical() returns FALSE. Can you work out why?