One of the primary functions of the JVMG project is to enable researchers to work with existing data in ways that are not readily enabled by the data providers themselves. One way in which we are attempting to facilitate this flexible data work is through the use of Linked Data. As we are working with a diverse set of data providers, the ways in which they create, store, and serve data are similarly diverse. Some of these providers are MediaWiki pages, with data being available as JSON through the use of an API, while others are closer to searchable databases, with data existing as SQL and being offered in large data dumps.
What remains constant across these data providers is our general data workflow; data must be accessed in some way, analyzed so that a suitable ontology can be created that is able to represent the data, transformed into a Linked Data format (in our case RDF), and finally made available so that it is able to be worked with by researchers. To give readers an idea of what this workflow looks like and how the data we work with is altered in a way to help it meet the needs of researchers, we’ll be going over a couple of these steps in separate blog post. Here, we’ll talk about the creation of the ontology based on how data providers describe their own data, and in a followup post, we’ll talk about some technical aspects of data transformation.
Creating the Ontology
Before data can be transformed into another format, we need to create underlying Linked Data ontology, or vocabulary, based on the individual dataset. This ontology is used for things like defining terms, relationships, and constraints for a particular dataset and its properties. Though they tend to be less formal than a typical Linked Data ontology, each data provider already has some form of vocabulary in use. We often rely on these for guidance when developing our own ontologies to represent their data, though often some significant deviations are required. A look at the intellectual process behind ontology creation may be covered in another post, but for now we’ll look at some of the technical details. Also note that while data ingestion is the first step in our workflow, the introductory paragraphs are essentially all that take place during that stage, and so no detailed explanation is needed. Data is either made available as an API, or a public or private database dump, for us to ingest and then work with.
RDF, or the Resource Description Framework, is a data model that, quoting Wikipedia, “is based on the idea of making statements about resources (in particular web resources) in expressions of the form subject–predicate–object, known as triples. The subject denotes the resource, and the predicate denotes traits or aspects of the resource, and expresses a relationship between the subject and the object.”
For example, if a given visual novel was released in 2020, we can make an RDF statements describing this data point using a triple like
<http://mediagraph.link/example/data/visualnovel1> <http://mediagraph.link/example/ontology/releaseYear> <“2020”>. The predicate of this triple, –
releaseYear – is an example of a vocabulary term that we can create in order for existing data to be described as part of an ontology using RDF Schema, which is a way of describing ontologies using RDF.
The creation of an ontology can be done in several ways. We generally use the Protégé tool to do this, though if one knows how to create the syntax properly, even a text editor can be used to create an ontology file that other tools can interpret. Protégé is rather straightforward to use and allows for easy class and subclass creation, property commenting, and reasoning.
Though the creation of an ontology is generally not difficult, a simple transformation involving taking the labels from the original dataset and applying them as-is to an RDF vocabulary is not usually possible. As mentioned earlier, we receive the data in a variety of formats. Data in formats like SQL tables may share the same, nondescript column header names for a number of different properties across several different tables, and so we often rely on how the provider’s data looks on the web to discover what a given set of data means, and to provide guidance on the ontology creation itself, e.g. what properties should be called, or what type of data should be the subclass of another type.
The end result of this part of the workflow results in a formal ontology for a given set of data in an RDF-compatible format that is able to be applied to a given dataset. Once the data itself has been transformed into RDF, which we’ll discuss in part II, and loaded into some type of web frontend, the information contained in the ontology file is used to add more context and meaning to the data through things like restrictions, comments, and labels, in a way that hopefully aligns with the intent of the original data. The ontology creation process is simple, but it requires a thorough understanding of the data in order to be done correctly, and it is an important component in making Linked Data easily workable and understandable for users.