In a previous post, we discussed the creation of a Linked Data ontology that can be used to describe existing fan-created data that the JVMG is working with. For the ontology to work correctly, the data itself must also be converted into a Linked Data format, and so in this post we’ll be discussing the transformation of data, as it’s received from providers, into RDF.
To summarize, our workflow involves using python and the RDFLib library inside a set of Jupyter notebooks to transform and export the data from all of the data provider partners. Data ingestion is also sometimes done using Python and Jupyter notebooks, but here we’ll just focus on the data transformation.
Though parts of the process can change rather drastically between data providers, and even between separate sets of data from the same provider, we’ll illustrate the general process using an example from a provider that makes data available as SQL tables.
Above is a Jupyter Notebook screenshot showing the first two cells of the data transformation; the first gathers the headers, which in this case are stored in a separate file, and the second applies the headers to the data itself. The result is a TSV with proper headers that we can then begin to transform into RDF.
Above is an edited and much simplified version of the actual data transformation cell that utilizes the RDFLib library. The cell iterates over the entire data table, and then, with the
g.add lines, creates RDF triples using various parameters. In essence, the triples are being created using the row ID (or some other unique identifier) as the triple subject, the column header as the object, and the row value as the predicate. Depending on the data, several manipulations take place at this stage, such as the transformation of certain string values into integers to better work with the ontology labels, and the appending of URL prefixes so that things such as Wikipedia article names may be turned into full URLs (if interested, here is an example of a more thorough cell). The last step here is to output the data as a TTL file for us to then use with our Linked Data frontend. Thus, the process transforms tabular data…
..into RDF data:
As was introduced in the previous blog post, RDF is a set of statements about resources, expressed as triples, or the subject-predicate-object form, which the data has now been transformed into. In the above screenshot, the subject can be seen on the unindented lines, e.g.
<http://mediagraph.link/vndb/vns/10> in line 18. Following this is a block that contains the predicate and object pairs describing that subject, so we can see that the resource represented by
<http://mediagraph.link/vndb/vns/10> is described as having the title “Narcissu”, an alias “ナルキッソス”, and a Wikipedia link of “https://en.wikipedia.org/wiki/Narcissu“.
This transformation into RDF allows us to load it into our web frontend, making the data easily browsable. Once data from multiple providers has been identified as describing the same resource, the frontend is also a convenient way of viewing this matched, aggregate data on a single page. Importantly, the transformation into RDF allows for advanced queries to be run on the data using SPARQL.
Hopefully these posts have provided a bit of insight into how we’re working with the data. Some upcoming blog posts will explore different use cases, and should be helpful in providing insight as to the types of research the data transformation can enable.