Turning Fan-Created Data into Linked Data II: Data Transformation

In a previous post, we discussed the creation of a Linked Data ontology that can be used to describe existing fan-created data that the JVMG is working with. For the ontology to work correctly, the data itself must also be converted into a Linked Data format, and so in this post we’ll be discussing the transformation of data, as it’s received from providers, into RDF.

To summarize, our workflow involves using python and the RDFLib library inside a set of Jupyter notebooks to transform and export the data from all of the data provider partners. Data ingestion is also sometimes done using Python and Jupyter notebooks, but here we’ll just focus on the data transformation. 

Though parts of the process can change rather drastically between data providers, and even between separate sets of data from the same provider, we’ll illustrate the general process using an example from a provider that makes data available as SQL tables.

Jupyter Notebook screenshot.
Jupyter Notebook screenshot.

Above is a Jupyter Notebook screenshot showing the first two cells of the data transformation; the first gathers the headers, which in this case are stored in a separate file, and the second applies the headers to the data itself. The result is a TSV with proper headers that we can then begin to transform into RDF.

Jupyer Notebook screenshot.

Above is an edited and much simplified version of the actual data transformation cell that utilizes the RDFLib library. The cell iterates over the entire data table, and then, with the g.add lines, creates RDF triples using various parameters. In essence, the triples are being created using the row ID (or some other unique identifier) as the triple subject, the column header as the object, and the row value as the predicate. Depending on the data, several manipulations take place at this stage, such as the transformation of certain string values into integers to better work with the ontology labels, and the appending of URL prefixes so that things such as Wikipedia article names may be turned into full URLs (if interested, here is an example of a more thorough cell). The last step here is to output the data as a TTL file for us to then use with our Linked Data frontend. Thus, the process transforms tabular data…

Raw TSV data.
Raw TSV data.

..into RDF data:

RDF output.
RDF output.

As was introduced in the previous blog post, RDF is a set of statements about resources, expressed as triples, or the subject-predicate-object form, which the data has now been transformed into. In the above screenshot, the subject can be seen on the unindented lines, e.g. <http://mediagraph.link/vndb/vns/10> in line 18. Following this is a block that contains the predicate and object pairs describing that subject, so we can see that the resource represented by <http://mediagraph.link/vndb/vns/10> is described as having the title “Narcissu”, an alias “ナルキッソス”, and a Wikipedia link of “https://en.wikipedia.org/wiki/Narcissu“.

This transformation into RDF allows us to load it into our web frontend, making the data easily browsable. Once data from multiple providers has been identified as describing the same resource, the frontend is also a convenient way of viewing this matched, aggregate data on a single page. Importantly, the transformation into RDF allows for advanced queries to be run on the data using SPARQL.

Hopefully these posts have provided a bit of insight into how we’re working with the data. Some upcoming blog posts will explore different use cases, and should be helpful in providing insight as to the types of research the data transformation can enable.

Turning Fan-Created Data into Linked Data I: Ontology Creation

One of the primary functions of the JVMG project is to enable researchers to work with existing data in ways that are not readily enabled by the data providers themselves. One way in which we are attempting to facilitate this flexible data work is through the use of Linked Data. As we are working with a diverse set of data providers, the ways in which they create, store, and serve data are similarly diverse. Some of these providers are MediaWiki pages, with data being available as JSON through the use of an API, while others are closer to searchable databases, with data existing as SQL and being offered in large data dumps. 

What remains constant across these data providers is our general data workflow; data must be accessed in some way, analyzed so that a suitable ontology can be created that is able to represent the data, transformed into a Linked Data format (in our case RDF), and finally made available so that it is able to be worked with by researchers. To give readers an idea of what this workflow looks like and how the data we work with is altered in a way to help it meet the needs of researchers, we’ll be going over a couple of these steps in separate blog post. Here, we’ll talk about the creation of the ontology based on how data providers describe their own data, and in a followup post, we’ll talk about some technical aspects of data transformation.

Creating the Ontology

Before data can be transformed into another format, we need to create underlying Linked Data ontology, or vocabulary, based on the individual dataset. This ontology is used for things like defining terms, relationships, and constraints for a particular dataset and its properties. Though they tend to be less formal than a typical Linked Data ontology, each data provider already has some form of vocabulary in use. We often rely on these for guidance when developing our own ontologies to represent their data, though often some significant deviations are required. A look at the intellectual process behind ontology creation may be covered in another post, but for now we’ll look at some of the technical details. Also note that while data ingestion is the first step in our workflow, the introductory paragraphs are essentially all that take place during that stage, and so no detailed explanation is needed. Data is either made available as an API, or a public or private database dump, for us to ingest and then work with. 

RDF, or the Resource Description Framework, is a data model that, quoting Wikipedia, “is based on the idea of making statements about resources (in particular web resources) in expressions of the form subject–predicate–object, known as triples. The subject denotes the resource, and the predicate denotes traits or aspects of the resource, and expresses a relationship between the subject and the object.”

For example, if a given visual novel was released in 2020, we can make an RDF statements describing this data point using a triple like <http://mediagraph.link/example/data/visualnovel1> <http://mediagraph.link/example/ontology/releaseYear> <“2020”>. The predicate of this triple, – releaseYear – is an example of a vocabulary term that we can create in order for existing data to be described as part of an ontology using RDF Schema, which is a way of describing ontologies using RDF.

The creation of an ontology can be done in several ways. We generally use the Protégé tool to do this, though if one knows how to create the syntax properly, even a text editor can be used to create an ontology file that other tools can interpret. Protégé is rather straightforward to use and allows for easy class and subclass creation, property commenting, and reasoning. 

A screenshot of the Protégé showing a list of data properties used in an ontology.
Protégé screenshot.

Though the creation of an ontology is generally not difficult, a simple transformation involving taking the labels from the original dataset and applying them as-is to an RDF vocabulary is not usually possible. As mentioned earlier, we receive the data in a variety of formats. Data in formats like SQL tables may share the same, nondescript column header names for a number of different properties across several different tables, and so we often rely on how the provider’s data looks on the web to discover what a given set of data means, and to provide guidance on the ontology creation itself, e.g. what properties should be called, or what type of data should be the subclass of another type. 

A screenshot of VNDB. We can use information here to provide some additional guidance with the ontology creation that the raw data itself may not provide.
Visual Novel Database screenshot.

The end result of this part of the workflow results in a formal ontology for a given set of data in an RDF-compatible format that is able to be applied to a given dataset. Once the data itself has been transformed into RDF, which we’ll discuss in part II, and loaded into some type of web frontend, the information contained in the ontology file is used to add more context and meaning to the data through things like restrictions, comments, and labels, in a way that hopefully aligns with the intent of the original data. The ontology creation process is simple, but it requires a thorough understanding of the data in order to be done correctly, and it is an important component in making Linked Data easily workable and understandable for users.