Following the success of our project launching workshop in July 2019, the work on processing community databases started in earnest (you can read about the technical details of the process in relation to ontology creation and data transformation). By November 2019, we were ready to start examining the data and our infrastructure through the lens of exploratory research.
We decided to adopt the Tiny Use Case workflow methodology to have a number of short-term research projects that would be substantial enough to generate meaningful and interesting results in their own right, but would be compact enough to provide an ongoing stream of feedback on issues with the database, the project infrastructure, and researcher needs. Since each Tiny Use Case is only 3-4 months long, it provides us with an excellent tool for assessing our progress and for uncovering newer issues, as each TUC has a different focus and somewhat different requirements.
After taking a six month break due to an internship, I restarted my work as a student assistant for the Japanese Visual Media Graph project in April 2020.
Currently, my main occupation is in the field of data quality control. After getting lots of data from different fan communities, the quality of said data needs to be checked against other sources to make sure there aren’t any errors adopted into the project’s database. To get started with this task, it was decided to first check several small data samples from different providers to enable an easier determination of the duration, effort, and expectable problems and results that a wider data quality control would entail.
I received the first two data samples from two different fan sites, both containing twenty entries of anime with several properties for me to check; those properties were, for example, the Japanese and English titles of the anime, the producing studio’s name in Japanese and English, the release date of the first episode, and the overall episode count or the completeness of a series. The properties in the samples depended on the usage of properties by the fan communities the data came from, and my task was to check if the entries were correct or if they contained some errors. To prove something, I had to find a source of ground truth for it, which would occasionally prove to be some kind of a challenge. Of course, the anime on DVD would actually be the best source of ground truth, but since the resources for this simply didn’t exist, I relied on other sources. A valid source of ground truth would, for example, be the opening or ending sequence of an anime, preferably found on YouTube or a legal streaming source like Netflix or Crunchyroll. An image of the DVD-case of an anime would also be a usable source for ground truth.
I worked with the data in an excel sheet, marking the correctness of the respective properties accordingly and adding screenshots and links to my sources of ground truth.
I encountered a noticeable difference in finding ground truth for the different properties. Finding proof for the Japanese or English anime titles almost never posed a problem; they usually could be found in the opening sequences or on DVD-cases. The year of first release could also be usually seen in the opening or ending sequences. The exact date was however sometimes quite difficult to proof. While I tried to use the Japanese Amazon Prime at first, it proved to be not reliable enough. Most of the time I could only return to the Media Arts Database to find proof for an exact date of release.
The names of the producing studios could usually be found easily inside the opening and ending sequence of the respective anime; however I sometimes encountered the problem that the studio didn’t write its own name in katakana, like provided by the fan-based data. While I usually could validate that it was indeed the correct studio, the spelling couldn’t be found in any official source for ground truth. I always marked those occurrences as “correct but strange” and left it open for further decisions.
A complicated property was the completeness of an anime. Whether or not this is a usable or provable property remains to be discussed.
After having checked those first two data samples, I can state that there was a lot of correct data, but also errors of different types. How to deal with them is also currently a point of discussion. The next samples from new sources will surely bring even more new experiences and insights.
In a previous post, we discussed the creation of a Linked Data ontology that can be used to describe existing fan-created data that the JVMG is working with. For the ontology to work correctly, the data itself must also be converted into a Linked Data format, and so in this post we’ll be discussing the transformation of data, as it’s received from providers, into RDF.
To summarize, our workflow involves using python and the RDFLib library inside a set of Jupyter notebooks to transform and export the data from all of the data provider partners. Data ingestion is also sometimes done using Python and Jupyter notebooks, but here we’ll just focus on the data transformation.
One of the primary functions of the JVMG project is to enable researchers to work with existing data in ways that are not readily enabled by the data providers themselves. One way in which we are attempting to facilitate this flexible data work is through the use of Linked Data. As we are working with a diverse set of data providers, the ways in which they create, store, and serve data are similarly diverse. Some of these providers are MediaWiki pages, with data being available as JSON through the use of an API, while others are closer to searchable databases, with data existing as SQL and being offered in large data dumps.
What remains constant across these data providers is our general data workflow; data must be accessed in some way, analyzed so that a suitable ontology can be created that is able to represent the data, transformed into a Linked Data format (in our case RDF), and finally made available so that it is able to be worked with by researchers. To give readers an idea of what this workflow looks like and how the data we work with is altered in a way to help it meet the needs of researchers, we’ll be going over a couple of these steps in separate blog post. Here, we’ll talk about the creation of the ontology based on how data providers describe their own data, and in a followup post, we’ll talk about some technical aspects of data transformation.
Taking their inspiration from agile software development principles, the Tiny Use Case workflow was created to handle the needs of a complex research project that required the meshing of expertise from very different disciplinary backgrounds and involved a high level of uncertainty regarding the types of challenges that would emerge in the course of the project. By working through a series of three to four month long Tiny Use Cases the diggr team was able to leverage a similar cycle of continuous incremental innovations and assessments that is one of the main strengths of agile approaches.
UPDATE 2020. 03. 14.: Due to the situation regarding COVID-19 both of our upcoming conference appearances have been postponed. The Mechademia Conference will take place next year, and the Building Bridges Symposium will be held on an as yet undecided future date.
We will be introducing our first results at the following upcoming conferences and workshops. If you are interested in talking with a team member about our project, please feel free to contact us.
Our project will be present at the MAGIC workshop in conjunction with the ICADL2019 conference in Kuala Lumpur.
At MAGIC- Information Commons for Manga, Anime and video Games we wish to discuss issues for creating and sharing information about MAG contents and resources. A primary topic is metadata for MAG. Metadata covers broad range of information – from descriptions about a content to a vocabulary to organize MAG resources, and from metadata creation to search and access.
The July workshop we discussed in our last post has now concluded, and we’re pleased to report that the event was quite successful! After introducing our own JVMG project, and the diggr project from Leipzig University, several of the invited community members gave presentations about their own sites and experiences. Over the two days of the conference, a lot of insightful discussions took place, and we think both the project and community members were able to both contribute and receive some information that will be helpful, either to their research projects or their community efforts.
Regarding the JVMG project, we received a lot of useful feedback from the invited community members, which will be important in allowing us to clarify how we communicate our project with other sites and interested parties in the future. In addition, all participants seemed interested in working with us on the project in various ways. In the near future, this will take the form of some data sharing agreements made between our project and various fan communities, allowing us to begin the project in earnest by collecting and analyzing a significant amount of aggregated, community-created data.
We will be holding our first workshop in cooperation with the diggr project at the Leipzig University Library on July 2-3rd 2019. Participants include community-driven initiatives and fansites from North-America and Europe, along with more data-focused initiatives.
The first day of the workshop is open to the public, schedule is as follows: