Working with the Tiny Use Case workflow methodology in the JVMG project

Following the success of our project launching workshop in July 2019, the work on processing community databases started in earnest (you can read about the technical details of the process in relation to ontology creation and data transformation). By November 2019, we were ready to start examining the data and our infrastructure through the lens of exploratory research.

We decided to adopt the Tiny Use Case workflow methodology to have a number of short-term research projects that would be substantial enough to generate meaningful and interesting results in their own right, but would be compact enough to provide an ongoing stream of feedback on issues with the database, the project infrastructure, and researcher needs. Since each Tiny Use Case is only 3-4 months long, it provides us with an excellent tool for assessing our progress and for uncovering newer issues, as each TUC has a different focus and somewhat different requirements.

The first TUC ran from November 2019 to February 2020, the second one spanned the Spring of 2020, and we are currently in the process of Tiny Use Case three, which started in June and is scheduled to run until September. With two TUCs already completed and one very much underway we already have a number of takeaways that have come out of these research projects. Some of these are minor, but can still lead to new features, like the implementation of the “simple” view on the front-end prototype. While others can be so fundamental that they are not readily resolvable within the span of a single TUC and become mainstay fixtures of the feedback to come out of subsequent TUCs. An example of such a takeaway is our on-going realization of the severity and thus work required when handling big data issues in the research projects.

The wrap-up and evaluation phase of the TUCs also offers a chance to momentarily stop doing what we are trying to do the hardest, which may or may not be the right way of approaching certain problems, and just look at the big picture again for some much needed reflection. One of the ideas that has come out of these “taking a step back to reflect” moments is the need to engage with the toolkit of network analysis. The most common tools used in data analysis require tabular data as inputs, but we are actually investing a lot of time and effort into moving data out of their original various tabular formats and into a unified linked data format. One of the defining characteristics of the linked data format is that we end up with a graph of all data elements. What if we could harness this aspect of the data structure, either to gain new insights specific to network analysis, or maybe to perform certain types of analysis more efficiently? We are currently exploring the opportunities opened up by this approach, and will be reporting on our findings here on the blog in the future.

Finally, thanks to the iterative nature of the TUC workflow methodology we can also experience the way the growth of our library of data extraction and analysis pipeline elements, as well as of our list of identified best practices, directly contributes to making each subsequent TUC easier to get off the ground and running. Not to mention the way communication between the library and computer science researchers and the humanities and social science researchers is becoming increasingly smoother with each new TUC, as both parties learn more and more about the concepts and tools of the other group, with an increasingly shared vocabulary emerging from our work together.

We look forward to sharing the most interesting questions and findings from each of our TUCs here on the blog, so stay tuned!

Data quality and ground truth

After taking a six month break due to an internship, I restarted my work as a student assistant for the Japanese Visual Media Graph project in April 2020. 

Currently, my main occupation is in the field of data quality control. After getting lots of data from different fan communities, the quality of said data needs to be checked against other sources to make sure there aren’t any errors adopted into the project’s database. To get started with this task, it was decided to first check several small data samples from different providers to enable an easier determination of the duration, effort, and expectable problems and results that a wider data quality control would entail. 

I received the first two data samples from two different fan sites, both containing twenty entries of anime with several properties for me to check; those properties were, for example, the Japanese and English titles of the anime, the producing studio’s name in Japanese and English, the release date of the first episode, and the overall episode count or the completeness of a series. The properties in the samples depended on the usage of properties by the fan communities the data came from, and my task was to check if the entries were correct or if they contained some errors. To prove something, I had to find a source of ground truth for it, which would occasionally prove to be some kind of a challenge. Of course, the anime on DVD would actually be the best source of ground truth, but since the resources for this simply didn’t exist, I relied on other sources. A valid source of ground truth would, for example, be the opening or ending sequence of an anime, preferably found on YouTube or a legal streaming source like Netflix or Crunchyroll. An image of the DVD-case of an anime would also be a usable source for ground truth. 

I worked with the data in an excel sheet, marking the correctness of the respective properties accordingly and adding screenshots and links to my sources of ground truth.

excel screenshot of the first data sample

I encountered a noticeable difference in finding ground truth for the different properties. Finding proof for the Japanese or English anime titles almost never posed a problem; they usually could be found in the opening sequences or on DVD-cases. The year of first release could also be usually seen in the opening or ending sequences. The exact date was however sometimes quite difficult to proof. While I tried to use the Japanese Amazon Prime at first, it proved to be not reliable enough. Most of the time I could only return to the Media Arts Database to find proof for an exact date of release.

The names of the producing studios could usually be found easily inside the opening and ending sequence of the respective anime; however I sometimes encountered the problem that the studio didn’t write its own name in katakana, like provided by the fan-based data. While I usually could validate that it was indeed the correct studio, the spelling couldn’t be found in any official source for ground truth. I always marked those occurrences as “correct but strange” and left it open for further decisions.

example of the above mentioned studio problem

A complicated property was the completeness of an anime. Whether or not this is a usable or provable property remains to be discussed.

After having checked those first two data samples, I can state that there was a lot of correct data, but also errors of different types. How to deal with them is also currently a point of discussion. The next samples from new sources will surely bring even more new experiences and insights.

Turning Fan-Created Data into Linked Data II: Data Transformation

In a previous post, we discussed the creation of a Linked Data ontology that can be used to describe existing fan-created data that the JVMG is working with. For the ontology to work correctly, the data itself must also be converted into a Linked Data format, and so in this post we’ll be discussing the transformation of data, as it’s received from providers, into RDF.

To summarize, our workflow involves using python and the RDFLib library inside a set of Jupyter notebooks to transform and export the data from all of the data provider partners. Data ingestion is also sometimes done using Python and Jupyter notebooks, but here we’ll just focus on the data transformation. 

Though parts of the process can change rather drastically between data providers, and even between separate sets of data from the same provider, we’ll illustrate the general process using an example from a provider that makes data available as SQL tables.

Jupyter Notebook screenshot.
Jupyter Notebook screenshot.

Above is a Jupyter Notebook screenshot showing the first two cells of the data transformation; the first gathers the headers, which in this case are stored in a separate file, and the second applies the headers to the data itself. The result is a TSV with proper headers that we can then begin to transform into RDF.

Jupyer Notebook screenshot.

Above is an edited and much simplified version of the actual data transformation cell that utilizes the RDFLib library. The cell iterates over the entire data table, and then, with the g.add lines, creates RDF triples using various parameters. In essence, the triples are being created using the row ID (or some other unique identifier) as the triple subject, the column header as the object, and the row value as the predicate. Depending on the data, several manipulations take place at this stage, such as the transformation of certain string values into integers to better work with the ontology labels, and the appending of URL prefixes so that things such as Wikipedia article names may be turned into full URLs (if interested, here is an example of a more thorough cell). The last step here is to output the data as a TTL file for us to then use with our Linked Data frontend. Thus, the process transforms tabular data…

Raw TSV data.
Raw TSV data.

..into RDF data:

RDF output.
RDF output.

As was introduced in the previous blog post, RDF is a set of statements about resources, expressed as triples, or the subject-predicate-object form, which the data has now been transformed into. In the above screenshot, the subject can be seen on the unindented lines, e.g. <http://mediagraph.link/vndb/vns/10> in line 18. Following this is a block that contains the predicate and object pairs describing that subject, so we can see that the resource represented by <http://mediagraph.link/vndb/vns/10> is described as having the title “Narcissu”, an alias “ナルキッソス”, and a Wikipedia link of “https://en.wikipedia.org/wiki/Narcissu“.

This transformation into RDF allows us to load it into our web frontend, making the data easily browsable. Once data from multiple providers has been identified as describing the same resource, the frontend is also a convenient way of viewing this matched, aggregate data on a single page. Importantly, the transformation into RDF allows for advanced queries to be run on the data using SPARQL.

Hopefully these posts have provided a bit of insight into how we’re working with the data. Some upcoming blog posts will explore different use cases, and should be helpful in providing insight as to the types of research the data transformation can enable.

Turning Fan-Created Data into Linked Data I: Ontology Creation

One of the primary functions of the JVMG project is to enable researchers to work with existing data in ways that are not readily enabled by the data providers themselves. One way in which we are attempting to facilitate this flexible data work is through the use of Linked Data. As we are working with a diverse set of data providers, the ways in which they create, store, and serve data are similarly diverse. Some of these providers are MediaWiki pages, with data being available as JSON through the use of an API, while others are closer to searchable databases, with data existing as SQL and being offered in large data dumps. 

What remains constant across these data providers is our general data workflow; data must be accessed in some way, analyzed so that a suitable ontology can be created that is able to represent the data, transformed into a Linked Data format (in our case RDF), and finally made available so that it is able to be worked with by researchers. To give readers an idea of what this workflow looks like and how the data we work with is altered in a way to help it meet the needs of researchers, we’ll be going over a couple of these steps in separate blog post. Here, we’ll talk about the creation of the ontology based on how data providers describe their own data, and in a followup post, we’ll talk about some technical aspects of data transformation.

Creating the Ontology

Before data can be transformed into another format, we need to create underlying Linked Data ontology, or vocabulary, based on the individual dataset. This ontology is used for things like defining terms, relationships, and constraints for a particular dataset and its properties. Though they tend to be less formal than a typical Linked Data ontology, each data provider already has some form of vocabulary in use. We often rely on these for guidance when developing our own ontologies to represent their data, though often some significant deviations are required. A look at the intellectual process behind ontology creation may be covered in another post, but for now we’ll look at some of the technical details. Also note that while data ingestion is the first step in our workflow, the introductory paragraphs are essentially all that take place during that stage, and so no detailed explanation is needed. Data is either made available as an API, or a public or private database dump, for us to ingest and then work with. 

RDF, or the Resource Description Framework, is a data model that, quoting Wikipedia, “is based on the idea of making statements about resources (in particular web resources) in expressions of the form subject–predicate–object, known as triples. The subject denotes the resource, and the predicate denotes traits or aspects of the resource, and expresses a relationship between the subject and the object.”

For example, if a given visual novel was released in 2020, we can make an RDF statements describing this data point using a triple like <http://mediagraph.link/example/data/visualnovel1> <http://mediagraph.link/example/ontology/releaseYear> <“2020”>. The predicate of this triple, – releaseYear – is an example of a vocabulary term that we can create in order for existing data to be described as part of an ontology using RDF Schema, which is a way of describing ontologies using RDF.

The creation of an ontology can be done in several ways. We generally use the Protégé tool to do this, though if one knows how to create the syntax properly, even a text editor can be used to create an ontology file that other tools can interpret. Protégé is rather straightforward to use and allows for easy class and subclass creation, property commenting, and reasoning. 

A screenshot of the Protégé showing a list of data properties used in an ontology.
Protégé screenshot.

Though the creation of an ontology is generally not difficult, a simple transformation involving taking the labels from the original dataset and applying them as-is to an RDF vocabulary is not usually possible. As mentioned earlier, we receive the data in a variety of formats. Data in formats like SQL tables may share the same, nondescript column header names for a number of different properties across several different tables, and so we often rely on how the provider’s data looks on the web to discover what a given set of data means, and to provide guidance on the ontology creation itself, e.g. what properties should be called, or what type of data should be the subclass of another type. 

A screenshot of VNDB. We can use information here to provide some additional guidance with the ontology creation that the raw data itself may not provide.
Visual Novel Database screenshot.

The end result of this part of the workflow results in a formal ontology for a given set of data in an RDF-compatible format that is able to be applied to a given dataset. Once the data itself has been transformed into RDF, which we’ll discuss in part II, and loaded into some type of web frontend, the information contained in the ontology file is used to add more context and meaning to the data through things like restrictions, comments, and labels, in a way that hopefully aligns with the intent of the original data. The ontology creation process is simple, but it requires a thorough understanding of the data in order to be done correctly, and it is an important component in making Linked Data easily workable and understandable for users. 

What is a Tiny Use Case?

The term Tiny Use Case, or TUC for short, was coined by the diggr (Databased Infrastructure for Global Games Culture Research) research project team. A detailed description of this workflow methodology can be found in their paper With small steps to the big picture: A method and tool negotiation workflow (Freybe, Rämisch and Hoffmann 2019).

Taking their inspiration from agile software development principles, the Tiny Use Case workflow was created to handle the needs of a complex research project that required the meshing of expertise from very different disciplinary backgrounds and involved a high level of uncertainty regarding the types of challenges that would emerge in the course of the project. By working through a series of three to four month long Tiny Use Cases the diggr team was able to leverage a similar cycle of continuous incremental innovations and assessments that is one of the main strengths of agile approaches.

In their description of the TUC workflow, the diggr project identified three key phases:

  1. Mediation of the research interest/object
  2. Exploring software solutions
  3. Evaluation

As they themselves put it: “The basic idea of our approach is that D [the digital expert] and H [the humanities scholar] educate each other about the respective domain specific blind spots which leads to a common understanding and a shared technical terminology.” (p. 15) Thus, in the first phase the humanities scholar provides a research question that needs to be explained to the digital expert, so that they can translate the requirements of the project into technical terms together. And then in the next phase the digital expert works together with the humanities scholar to find an appropriate technical solution that can deliver the right type of data for the question at hand. Finally, the evaluation step is where the successes/short-comings of the TUC are identified, and takeaways for further TUCs are abstracted from the process.

The Tiny Use Case workflow methodology was adopted in the JVMG project precisely because of the similarities in the challenges faced by the two undertakings. Most importantly, the JVMG project also involves bridging disciplinary boundaries and developing an understanding between library and computer science on the one hand, and humanities and social science on the other. The fact that the TUC approach had already been developed and tested in a previous project was a huge help going forward for the JVMG project. It is important to emphasize, however, that this did not mean that our project was spared the same kind of difficulties that the diggr project encountered. Rather, it meant that we were more prepared for the types of problems that would inadvertently arise, and had a clear idea that these challenges were not to be automatically taken as signs of something going wrong, quite the opposite, they are the necessary steps through which each such project needs to develop.

Presence at upcoming conferences and workshops

UPDATE 2020. 03. 14.: Due to the situation regarding COVID-19 both of our upcoming conference appearances have been postponed. The Mechademia Conference will take place next year, and the Building Bridges Symposium will be held on an as yet undecided future date.

We will be introducing our first results at the following upcoming conferences and workshops. If you are interested in talking with a team member about our project, please feel free to contact us.

ICADL2019 and MAGIC Workshop

Our project will be present at the MAGIC workshop in conjunction with the ICADL2019 conference in Kuala Lumpur.

At MAGIC- Information Commons for Manga, Anime and video Games we wish to discuss issues for creating and sharing information about MAG contents and resources. A primary topic is metadata for MAG. Metadata covers broad range of information – from descriptions about a content to a vocabulary to organize MAG resources, and from metadata creation to search and access.

Links: http://icadl2019.org/index.php/workshop/magic

https://mdlab.slis.tsukuba.ac.jp/magic/

Workshop Report

The July workshop we discussed in our last post has now concluded, and we’re pleased to report that the event was quite successful! After introducing our own JVMG project, and the diggr project from Leipzig University, several of the invited community members gave presentations about their own sites and experiences. Over the two days of the conference, a lot of insightful discussions took place, and we think both the project and community members were able to both contribute and receive some information that will be helpful, either to their research projects or their community efforts.

Regarding the JVMG project, we received a lot of useful feedback from the invited community members, which will be important in allowing us to clarify how we communicate our project with other sites and interested parties in the future. In addition, all participants seemed interested in working with us on the project in various ways. In the near future, this will take the form of some data sharing agreements made between our project and various fan communities, allowing us to begin the project in earnest by collecting and analyzing a significant amount of aggregated, community-created data.

Finally, a big thanks to all of the community representatives from the Anime Characters Database, AnimeClick.it, Animexx, IGDB, Oregami, Stifftung Digitale Spielekultur, VNDB, and Wikidata’s Video Games Task Force for taking the time to come meet with us, and with the Leipzig members for hosting the workshop. This was an important initial milestone for us, and community involvement is vital to the foundation of our project, so we’re thankful for all of the interest and cooperation we received. We’ll be sure to update this blog as the project continues, so stay tuned!

All photos from Jean-Frédéric, Wikimedia.

Our first workshop!

We will be holding our first workshop in cooperation with the diggr project at the Leipzig University Library on July 2-3rd 2019. Participants include community-driven initiatives and fansites from North-America and Europe, along with more data-focused initiatives.

The first day of the workshop is open to the public, schedule is as follows:

Time

Presenter(s)

Theme

10.00

JVMG Project Group/diggr Project Group

Workshop Introduction

10.15

diggr Project Group

Project introduction and overview, project members introductions

11.00

JVMG Project Group

Project introduction, overview, aims and project members introductions

11.45

Coffee break

12.00

Stephen Goral (Anime Characters Database)

12.20

Jean-Frédéric Berthelot (Wikidata – Video Game Taskforce)

12.40

Lunch break

14.30

Yoran Heling (The Visual Novel Database)

14.50

Jerome Richer (IGDB)

15.10

Maria Pino (Animeclick.it)

15.30

Coffee break

15.50

Winfried Bergmeyer (Stifftung Digitale Spielekultur)

16.05

Marc Schuler (Animexx)

16.25

Jens Mildner (Oregami)

16.45

Coffee break

17.05

Round Table Discussion

Workshop_Poster_July_2nd