These are the course notes for the 2023 version of Fundamentals of Data Science (MA7419 / MA3419)
7.1 Overview
This week we’ll be covering different data structures:
XML
JSON
unstructured (or semi-structured) text
7.2 XML
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
XML can serve as the basis for defining markup languages for particular domains. For example XBRL (Extensible Business Reporting Language), KML (Keyhole Markup Language for geographic information), BeerXML (you guessed it).
Here is a simple example containing some data about my pets.
<pets>
<pet id = '001' species = 'dog'>
<tag>Rover</tag>
<colour>black</colour>
</pet>
<pet id = '002' species = 'cat'>
<tag>Tiddles</tag>
<colour>ginger</colour>
</pet>
<pet id = '003' species = 'dog'>
<tag>Fido</tag>
<colour>brownish</colour>
</pet>
</pets>
Even if you knew nothing about XML before, you can work out what is going on.
Package xml2
The package xml2 gives you tools to read an XML file and extract the data.
https://blog.rstudio.com/2015/04/21/xml2/
Normally you would read the data from an external file, but for a very small example I’ve saved the data above as a string called my_pets.
Navigating the tree
In XML everything is arranged in a tree structure, and each element is called a node.
You are familiar with a tree structure - just think of the way the files are organised on your computer: folders sit inside folders and each folder can contain other folders and files.
In a similar way, nodes sit inside nodes and each node can contain other nodes. The first node, that contains everything else, is the root node, and a node that doesn’t contain any other nodes is called a leaf node.
In our example … is the root node, and the text strings “Fido”, “Brownish” etc are leaf nodes.
Nodes can also contain attributes. For example the … nodes have “id” and “species” attributes.
Every node, except the root, has exactly one parent, and nodes can have children and siblings (nodes with the same parent).
We can use these concepts to navigate the tree.
xpets <-read_xml(my_pets)xml_name(xpets) # The name of the root node
[1] "pets"
xml_child(xpets) # Finds the first child of the root (Rover)
ID Name Species Colour
1 001 Rover dog black
2 002 Tiddles cat ginger
3 003 Fido dog brownish
7.3 JSON
JSON is a syntax for storing and exchanging data. It’s lightweight, human readable, language-independent and very widely used. Most programming languages can process JSON.
To see what JSON looks like we’ll use the jsonlite package to convert our pets data from a data frame to JSON.
JSON stands for JavaScript Object Notation. The text inside each set of curly brackets represents a JavaScript object - but it is just text and can be taken to JSON is independent of JavaScript. You can think of each object as being like a card in an old-fashioned card index system.
jsonlitehas a function to convert to a data frame.
fromJSON(jpets)
ID Name Species Colour
1 001 Rover dog black
2 002 Tiddles cat ginger
3 003 Fido dog brownish
7.4 Text processing
Text processing (a branch of Natural Language Processing, or NLP) is a big topic and we can only scratch the surface in this module.
In Wednesday’s class we will look at an example of sentiment analysis.
We’ll use a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011.(Nielsen 2011) but I have removed profanities from the list because I used it for a presentation in a school.
We’ll also use some classic texts downloaded from Project Gutenburg.
Please see the Further Reading for more information on text processing (and you’ll probably want to refresh your regex skills).
Nielsen, F. Å. 2011. “AFINN.” Richard Petersens Plads, Building 321, DK-2800 Kgs. Lyngby: Informatics; Mathematical Modelling, Technical University of Denmark. http://www2.compute.dtu.dk/pubdb/pubs/6010-full.html.