7 Some other data structures

These are the course notes for the 2023 version of Fundamentals of Data Science
(MA7419 / MA3419)

7.1 Overview

This week we’ll be covering different data structures:

XML
JSON
unstructured (or semi-structured) text

7.2 XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.

XML can serve as the basis for defining markup languages for particular domains. For example XBRL (Extensible Business Reporting Language), KML (Keyhole Markup Language for geographic information), BeerXML (you guessed it).

Here is a simple example containing some data about my pets.

<pets>
  <pet id = '001' species = 'dog'>
    <tag>Rover</tag>
    <colour>black</colour>
  </pet>
  <pet id = '002' species = 'cat'>
    <tag>Tiddles</tag>
    <colour>ginger</colour>
  </pet>
  <pet id = '003' species = 'dog'>
    <tag>Fido</tag>
    <colour>brownish</colour>
  </pet>
</pets>

Even if you knew nothing about XML before, you can work out what is going on.

Package `xml2`

The package xml2 gives you tools to read an XML file and extract the data.

https://blog.rstudio.com/2015/04/21/xml2/

Normally you would read the data from an external file, but for a very small example I’ve saved the data above as a string called my_pets.

Navigating the tree

In XML everything is arranged in a tree structure, and each element is called a node.

You are familiar with a tree structure - just think of the way the files are organised on your computer: folders sit inside folders and each folder can contain other folders and files.

In a similar way, nodes sit inside nodes and each node can contain other nodes. The first node, that contains everything else, is the root node, and a node that doesn’t contain any other nodes is called a leaf node.

In our example … is the root node, and the text strings “Fido”, “Brownish” etc are leaf nodes.

Nodes can also contain attributes. For example the … nodes have “id” and “species” attributes.

Every node, except the root, has exactly one parent, and nodes can have children and siblings (nodes with the same parent).

We can use these concepts to navigate the tree.

xpets <- read_xml(my_pets)
xml_name(xpets)  # The name of the root node

[1] "pets"

xml_child(xpets) # Finds the first child of the root (Rover)

{xml_node}
<pet id="001" species="dog">
[1] <tag>Rover</tag>
[2] <colour>black</colour>

xml_children(xpets) # Finds all the children of the root (Rover, Fido & Tibbles)

{xml_nodeset (3)}
[1] <pet id="001" species="dog">\n  <tag>Rover</tag>\n  <colour>black</colour ...
[2] <pet id="002" species="cat">\n  <tag>Tiddles</tag>\n  <colour>ginger</col ...
[3] <pet id="003" species="dog">\n  <tag>Fido</tag>\n  <colour>brownish</colo ...

xml_children(xpets) |> xml_name() # The name of each child

[1] "pet" "pet" "pet"

xml_child(xpets) |> xml_siblings() # the siblings of Rover

{xml_nodeset (2)}
[1] <pet id="002" species="cat">\n  <tag>Tiddles</tag>\n  <colour>ginger</col ...
[2] <pet id="003" species="dog">\n  <tag>Fido</tag>\n  <colour>brownish</colo ...

xml_child(xpets) |> xml_parent() # The parent of Rover

{xml_node}
<pets>
[1] <pet id="001" species="dog">\n  <tag>Rover</tag>\n  <colour>black</colour ...
[2] <pet id="002" species="cat">\n  <tag>Tiddles</tag>\n  <colour>ginger</col ...
[3] <pet id="003" species="dog">\n  <tag>Fido</tag>\n  <colour>brownish</colo ...

xml_child(xpets) |> xml_child() # <tag> and <colour> are children of Rover

{xml_node}
<tag>

Searching

We can navigate and search through the tree with more precision using XPath.

xml_find_first(xpets, '//pet[@species="cat"]') # Find the first cat

{xml_node}
<pet id="002" species="cat">
[1] <tag>Tiddles</tag>
[2] <colour>ginger</colour>

xpets |> 
      xml_find_all("//pet[@species='dog']") |> # List the dogs' names
      xml_find_all(".//tag") |>                # Note the important .
      xml_text()

[1] "Rover" "Fido"

xpets |>
  xml_find_all(".//pet[@species='dog']") |>
  xml_attr("id")

[1] "001" "003"

Create a data frame

By extracting each quantity we want separately we can put together a data frame.

pet_id <- 
xpets |>
  xml_find_all(".//pet") |>
  xml_attr("id")

pet_species <- 
xpets |>
  xml_find_all(".//pet") |>
  xml_attr("species")

pet_name <- 
  xpets |> 
  xml_find_all(".//pet/tag") |> 
  xml_text()

pet_colour <- 
  xpets |> 
  xml_find_all(".//pet/colour") |> 
  xml_text()

dfpets <- 
  data.frame(ID = pet_id,
             Name = pet_name,
             Species = pet_species,
             Colour = pet_colour)
dfpets

   ID    Name Species   Colour
1 001   Rover     dog    black
2 002 Tiddles     cat   ginger
3 003    Fido     dog brownish

7.3 JSON

JSON is a syntax for storing and exchanging data. It’s lightweight, human readable, language-independent and very widely used. Most programming languages can process JSON.

To see what JSON looks like we’ll use the jsonlite package to convert our pets data from a data frame to JSON.

library(jsonlite)
jpets <- toJSON(dfpets)
jpets

[{"ID":"001","Name":"Rover","Species":"dog","Colour":"black"},{"ID":"002","Name":"Tiddles","Species":"cat","Colour":"ginger"},{"ID":"003","Name":"Fido","Species":"dog","Colour":"brownish"}]

JSON stands for JavaScript Object Notation. The text inside each set of curly brackets represents a JavaScript object - but it is just text and can be taken to JSON is independent of JavaScript. You can think of each object as being like a card in an old-fashioned card index system.

We can make it easier to read:

prettify((jpets))

[
    {
        "ID": "001",
        "Name": "Rover",
        "Species": "dog",
        "Colour": "black"
    },
    {
        "ID": "002",
        "Name": "Tiddles",
        "Species": "cat",
        "Colour": "ginger"
    },
    {
        "ID": "003",
        "Name": "Fido",
        "Species": "dog",
        "Colour": "brownish"
    }
]

jsonlitehas a function to convert to a data frame.

fromJSON(jpets)

   ID    Name Species   Colour
1 001   Rover     dog    black
2 002 Tiddles     cat   ginger
3 003    Fido     dog brownish

7.4 Text processing

Text processing (a branch of Natural Language Processing, or NLP) is a big topic and we can only scratch the surface in this module.

In Wednesday’s class we will look at an example of sentiment analysis.

We’ll use a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011.(Nielsen 2011) but I have removed profanities from the list because I used it for a presentation in a school.

We’ll also use some classic texts downloaded from Project Gutenburg.

Please see the Further Reading for more information on text processing (and you’ll probably want to refresh your regex skills).

7.5 Reading

XML
- What is XML (Just the first bit)
- Parse and process XML (and HTML) with xml2
JSON
- Intro to JSON

7.6 Further reading

Text Mining with R: a tidy approach (Julia and David 2017) is an excellent introduction, compatible with the methods used in this course.

Check your understanding

Convert the starwars data to JSON and examine the structure.

jstarwars <- toJSON(starwars, pretty = TRUE)

Try converting back to a data frame and compare with the original.

dfstarwars <- fromJSON(jstarwars)

identical(starwars, dfstarwars)

[1] FALSE

identical() returns FALSE. Can you work out why?

7.1 Overview

7.2 XML

Package xml2

Navigating the tree

Searching

Create a data frame

7.3 JSON

7.4 Text processing

7.5 Reading

7.6 Further reading

Check your understanding

Package `xml2`