Tiny Use Case 2: Can we test one of the points from Hiroki Azuma’s “Otaku: Japan’s Database Animals” with the JVMG database? Part 2: Descriptive statistics

In the first part of this series we introduced Hiroki Azuma’s seminal book Otaku: Japan’s Database Animals, and identified a point to try and test on the JVMG database, namely that “many of the otaku characters created in recent years are connected to many characters across individual works” (p 49). This led to the formulation of the following two hypotheses.

    1. The number of new characters with shared traits should increase over time.
    2. The number of shared traits among new characters should increase over time.

How do we go about actually testing these hypotheses on the available data? Well, we would need to be able to somehow assign appearance dates to each character, otherwise we won’t be able to look at changes over time, and we also have to be able to define what characters with shared traits mean in the context of our data. So let’s take a look at the data we have to work with.

For this TUC we decided to use the data from The Visual Novel Database (VNDB) and Anime Characters Database (ACDB), as these databases both have a significant number of characters and a relatively large number and detailed level of traits describing them. There are, however, important differences between the two datasets. VNDB only focuses on visual novels, whereas ACDB collects data on a wide range of characters from various media (although predominantly focusing on visual novels and anime). Furthermore, VNDB has a very rich and rigorously structured – nevertheless open to extension by users – ontology of traits, which, however, lacks a core set of featured traits that would be expected to be available in relation to all characters. In contrast ACDB features a hybrid system for describing characters, which on the one hand supports a closed ontology for eight flagship traits that are part of each character’s fact sheet, and on the other hand provides the opportunity for a free form tagging of characters with user created labels.

The differences between the two datasets also becomes apparent when we try to assign a date of first appearance to the characters. In VNDB characters are linked to one or multiple visual novels in which they appear, and each visual novel has multiple releases. Finally visual novel releases have a release date value. Thus, to find the first appearance of each character in the dataset we looked up all releases for all visual novels that feature the given character, and compared all the release dates to find the earliest one. On the other hand, in ACDB characters have one work that they are featured in, and which has a date value. It is this date that we assigned to characters in the ACDB dataset as their first date of appearance. Although the date value, assigned according to the above described logic, in the case of the more detailed VNDB dataset is probably more reliable as the actual first date of appearance than the values assigned to the characters in the ACDB data, with more than 100.000 characters in the latter dataset and assuming the random nature of potential inconsistencies, we can assume that these problems should even themselves out and not impact our analysis. Note, that we could also have tried to use outside sources to assign first appearance dates to the characters, however, the very reason we are working with these databases is the assumption that they are the potentially best available source of data on these domains of interest.

Although defining shared traits is quite straight forward in the sense that if two characters both have the same trait then it is considered a shared trait for those two characters, there is still a question of which elements of the datasets to include as traits for our investigation. This question is ultimately related to what we want to potentially capture with our research. So, going back to our original hypotheses derived from Azuma’s work, we are hoping to see an increase in the number of characters that share not just one trait among them, but most likely a group of traits, as in the Ayanami Rei example discussed by Azuma. Furthermore, these groups of traits should be at least somewhat specific. For this reason we have opted to (a) exclude gender, as it is a trait that is almost always going to be a shared trait among characters, and in this way is just noise for our research purposes. (Note, that we could just as well run all of the following analysis with the gender trait included for both datasets, but apart from shifting the individual values, it would not impact the change over time we are interested in.) Also, we decided to (b) include the character tags for the ACDB data, as they relate to traits that people feel strongly enough about to want to add information on, thus are most likely essential to what makes a character a representative of a given character template.

The final and most important question regarding the operationalization of our research question is the way we define characters with shared traits. Now, this might seem surprising, since we already established above that shared traits are those traits that two characters both have in common. However, if we were to simply consider all characters which shared traits with any other character in the database to be characters with shared traits, we would most likely end up with the majority of characters sharing traits with other characters, without any meaningful structure to capture the change in character creation/consumption practices described by Azuma. Again, thinking about the actual meaning of what Azuma is trying to say, we can think of the increase in the number of characters with shared traits as a result of some popular trait combination appearing and that group of traits being replicated in other works that follow. Thus among new characters that appear within a given time frame, we should expect to find characters that share groups of traits, which can happen by chance, of course, but on a larger scale potentially resulting from this form of replication effect. For ease of analysis let’s define our time frame for which we will count characters with shared traits as a one year window, which means we will only compare characters and their traits for characters whose first appearance date matches each other’s. This will allow us to compare the change over time of the emergence of new characters with shared traits. Another decision we need to make is how to best capture the effect of trait groups being shared between characters. We have already decided to remove gender from among the traits we are considering, as that would lead to almost all characters sharing at least one trait with some other characters. By introducing a threshold for the number of traits two characters have to share among them for us to consider them character with shared traits we can try to filter out the noise of characters sharing traits among themselves out of pure chance and hopefully better capture the effect of the rise in shared groups of traits among characters. Although we would need to conduct more detailed measurements to ascertain that this cut-off point is the most productive in this regard, for the time being we will choose to go with a minimum number of five shared traits for us to consider two characters to have shared traits.

Furthermore, in order to allow for a better comparison between the VNDB and ACDB data, in the case of the ACDB dataset we will also distinguish between visual novels and other works, and compare characters for shared traits only among their respective groups (even though creative influence can and those travel between the realms of visual novels and other types of works). Since Azuma’s book is about Japanese works, we have also decided to disregard characters in the ACDB dataset that belong to media types that are clearly non-Japanese (e.g. ‘Western animation’) or have no media type information.

One final note on methodology, we have excluded traits relating to sexual activity from the VNDB dataset, as those would unnecessarily inflate shared trait counts, and would thus hinder the comparison with characters from other types of works.

In order to develop an understanding for the data let’s start by examining some summary statistics. First let’s take a look at the VNDB data. The graph below contains information on (a) the number of characters by year (top left), (b) the average number of traits per character for each year (bottom left), (c) the average number of characters that each character shares at least five traits with from the same year (top right), and finally, (d) the average number of shared traits among characters with at least five shared traits from the same year (bottom right).

The most striking point in relation to these four graphs is probably the similarity in their general shapes with the exception of the bottom right one (which is in fact also similar, as will be explained shortly). Let’s take a look at the top two graphs first. We can see that the number of characters recorded in the VNDB database for each year demonstrates a mostly growing tendency up until this trend plateaus out between 2011-2014 and then goes into a steady decline. The fact that the average number of characters that a character shares at least five traits with seems to follows this same pattern raises two important question. First, we do indeed find that the number of characters with shared traits increases over time, which would seem to support Azuma’s claim, however, what are we to make of the drop off of the number of characters with shared traits after 2013? Second, what if the increase in the number of characters with shared traits is just a simple result of there being more characters overall, which would also help explain the drop off in their numbers post-2013. This second point also raises an issue in relation to the way we had originally formulated our hypotheses, and we should reformulate them in a way to take this possibility into account:

  1. The portion of new characters with shared traits should increase over time.
  2. The portion of shared traits among new characters should increase over time.

Next, let’s examine the bottom left graph of the number of average traits by year. We can again clearly see a mostly rising tendency up until a plateau is reached for 2011-2018 and then a drop-off commences for 2019 and 2020. It is worth noting here that the snapshot of the data we are using in the JVMG project is from early 2020, thus it would seem logical that information on 2020 and potentially 2019 too is not yet as complete as for earlier years. And this line of reasoning could plausibly account for the drop off in the average number of traits recorded for characters, but it is very curious to see the number of characters recorded per year in the database starting to decline after 2014. Was there maybe a shock to the industry that led to there being less works produced? Or is there maybe a considerable time-lag in the Japanese visual novels reaching Western audiences, who are ultimately responsible for compiling the data on VNDB?

Finally, examining the average number of shared traits for characters with shared traits (bottom right graph), we can see that the figures are very close to five. This is a result of our definition of characters with shared traits employing a minimum number of shared traits of five, which obviously results in five being the bottom value for the average number of shared traits among characters with shared traits. Another very prominent feature of the graph is the steep drop until around 2000, with the trend reversing after that and we see a more gradual growth leading up to 2019. If we take a look at the top right graph, we can see that the average number of characters with shared traits is rather small until 1998, which would mean that the average shared trait values are probably skewed by a few instances of characters sharing a larger number of traits, which also happen to be the characters with shared traits. Once we move to the post-2000 part of the graph the values are averages for much larger character counts, and start to mirror the growth in the average number of traits (bottom left graph). This is again not surprising, with larger number of traits being recorded for characters the possibility for higher numbers of shared traits also increases.

Now that we have a general overview of our data, and have already identified some of the trends that are interesting for our present research question, let’s examine the relationship between the growth in the number of characters and the average number of characters that characters share at least five traits with on the one hand, and on the other hand the relationship between average number of traits per year and the average number of shared traits for characters with shared traits. Although we will be conducting a more complex analysis of these relationships in the third part of this blogpost series for now let’s take a look at a very simple approach to examining these two relationship.

Examining the change in the ratio of the average number of characters that characters share at least five traits with (abbreviated characters with shared traits in the following) by all characters on the top graph, we can see that up until 2013 the growth of characters with shared traits outperforms the increase in characters. Thus we could surmise that it cannot simply be the growth in the number of characters that is responsible for the increasing number of characters with shared traits. However, it is curious that once the number of characters per year starts to drop, the ratio at hand also starts to decline, indicating that the number of characters with shared traits decreases even faster, which seems to point to some other factor potentially being at play here.

The ratio of the average number of shared traits for characters with shared traits (abbreviated shared traits in the following) by the average number of traits per year in the bottom graph also displays a clear trend, especially if we concentrate on the section starting from 1998, as explained above. What we can see in this case is that the growth in the average number of traits consistently outperforms the increase in shared traits. This would seem to point towards our second hypothesis not standing up to the test of the actual data. However, it is important to keep in mind, that the growth in detail of character traits recorded in the database does not necessarily reflect the actual change in the level of detail that characters are created with.

Having found some interesting trends in the VNDB data, let’s now compare them to what we find in the ACDB dataset, and more specifically let’s see how the trends compare to those found for the visual novels and the other works in the ACDB data. The following graph contains the same subgraphs as the corresponding graph for VNDB above, with separate trend lines plotted for visual novels (yellow) and other works (blue). Note, that for the sake of comparability the graphs start from 1993, as there was no continuous data on visual novels prior to that in the ACDB dataset.

Right away we can see that the change in the number of characters in the ACDB data both for visual novels and for other works seems to follow a similar pattern to what we saw for the VNDB data. For visual novels the top plateau is between around 2007-2012, and for other works between around 2004-2016 (we could of course draw stricter boundaries in both cases). Even without checking any further sources, we know from the VNDB data that the number of characters in visual novels should not start to decline before 2014 (the absolute numbers are also higher in the VNDB database). So, maybe it is not just the actual number of characters that impacts the shape of these graphs, but also the amount of interest that members of the communities building the databases are willing to afford the works, characters and their documentation in these databases. In part three we will try to untangle this effect from the other potential causes for why and how the number and ratio of characters with shared traits changes over time.

Examining the bottom left graph on the average number of traits, the impact of the attention being payed to characters and the work invested in populating the database with data once again becomes very clear. It is hard to imagine that there would be such a gap between the level of detail between characters that appear in visual novels, and those that are featured in other works. Yet the graph demonstrates a very distinct difference between the two plot lines, with a significantly higher average trait count for visual novel characters all the way up to 2015, even though the magnitude of the number of characters is not that far off and often even roughly equivalent for the two groups. Also note the drop off in the average number of traits mirroring the progression of the recorded number of characters. Both of these phenomena point towards the significance of the investedness of contributors in building the database and its records.

Leading on from the higher number of average traits for visual novel characters it is no surprise that looking at the top right graph we find a higher average count of characters that characters share at least five traits with again for visual novels for the mid section of the graph. The early dominance of other works on the graph up to around 2000 is probably a result of the significantly higher number of recorded characters compared to visual novels as can be seen on the top left graph. Although the plot line for visual novels on this top right graph, similar to what we saw in the VNDB data, once again seems to roughly follow the one found on the graph for number of characters, this correspondence seems to be far less pronounced for other works.

Finally, looking at the bottom right graph on average number of shared traits it would be hard to discern any distinct trend visible to the naked eye (as was the case for the VNDB data) for either of the two plot lines, other than the clear drop-off at the end coinciding with the decreasing number of recorded traits.

Let us now look at the same ratios we have already examined for the VNDB data.

It’s hard to say for sure, but it is almost as if there was no increasing or decreasing trend for most of both of the plot lines in the top graph. Although we would need to fit actual regressions to answer whether this is in fact the case, which is precisely what we will be doing in part three. For now it is still interesting to note that even though some of the data in the ACDB dataset had a somewhat similar distribution to what we saw in the case of VNDB, the very obvious trend in the ratio of the average number of characters that characters share at least five traits with by all characters does not seem to be replicated here. The lower graph, on the other hand, is perhaps somewhat more similar to what we saw in the case of VNDB.

To better understand these differences and similarities, and to try and untangle the potential different effects influencing the changes in these ratios we will turn to the toolkit of regression analysis in the upcoming third part of this series on TUC 2.