Tiny Use Case 2: Can we test one of the points from Hiroki Azuma’s “Otaku: Japan’s Database Animals” with the JVMG database? Part 3: Regression analysis

Following the first part of this series, where we introduced Hiroki Azuma’s seminal book Otaku: Japan’s Database Animals, and identified the point (“many of the otaku characters created in recent years are connected to many characters across individual works” (p 49)) we are testing on the JVMG database; in part two we discussed the two datasets (The Visual Novel Database (VNDB) and Anime Characters Database (ACDB)) we are working with and the operationalization of our concepts on these datasets. Furthermore, we examined some key descriptive statistics , and based on what we saw, we reformulated our initial two hypotheses to be the following:

    1. The portion of new characters with shared traits should increase over time.
    2. The portion of shared traits among new characters should increase over time.

In this third and final part of the series we will apply the toolkit of regression analysis to try and understand the relationships – we saw in part two – in our data better, and hopefully get closer to testing out hypotheses. Regression analysis revolves around trying to estimate the relationship between the dependent variable (for which we would like to explain the observed changes in its values) and the independent variables (also called explanatory variables, since we aim to use them to explain the changes in the dependent variable’s values). If there are multiple possible relationships between the independent variables and the dependent variable, for example due to the large number of possible explanatory variables we could include in our model, the process of regression analysis involves comparing the different possible models and selecting the best performing one.

To make things simpler we will focus only on finding the relationship between the average number of characters traits are shared with and the average number of shared traits as our two dependent variables, and the number of characters, the average number of traits and the year as our explanatory independent variables.

First, we will discuss our results for the average number of characters traits are shared with, and we will again start with the VNDB dataset.

Examining the correlation between our independent variables in the VNDB data, we can see that there is a very strong correlation between the average number of traits and the year variables. This is going to be important to keep in mind for interpreting our results, and is also the source of much of the difficulty that was encountered in the regression model building process.

                     year  num_char  num_traits_mean
year             1.000000  0.829202         0.949948
num_char         0.829202  1.000000         0.905112
num_traits_mean  0.949948  0.905112         1.000000

In order to capture the non-linear relationships between the dependent variable and the independent variables, squared terms of the latter were also included in the regression model building process. Furthermore, to account for potential interactions between the independent variables, interaction terms were also created between them, and introduced in the model selection process. As a final note on the employed methodology, for each dataset only years with at least thirty characters were used and continuous ranges of years were selected in every case.

We used statsmodels to build and evaluate our regression models. The summary output (and the visualization created with seaborn) below is of our best performing model. It has one explanatory variable the interaction term between the squared value of the number of characters and the squared value of the average number of traits. The model confirms a number of common sense expectations. We would expect the average number of characters traits are shared with to increase with a growth in the number of characters, as well as with the average number of traits. However, the fact that it is the interaction term between the two squared terms of these variables that offers the best fit for our data demonstrates that the individual effects of the two variables are amplified by the changes in the other.

This corresponds to what we observed for the ratio of average number of characters traits are shared with by number of characters showing an increasing trend up until 2013, but then decreasing after that (see the figure in part 2). The effect of the growth/decline in number of characters is amplified by the parallel growth/decline in average number of traits. Thus the increase in the average number of characters traits are shared with outperforms the growth of the number of characters during the upward trend, and then the decrease in the average number of characters traits are shared with once again exceeds the decline in the number of characters during the downward trend phase.

                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     numCharWithSharedTrait_mean   R-squared (uncentered):                   0.984
Model:                                     OLS   Adj. R-squared (uncentered):              0.983
Method:                          Least Squares   F-statistic:                              1174.
Date:                         Tue, 22 Sep 2020   Prob (F-statistic):                    1.29e-25
Time:                                 10:30:58   Log-Likelihood:                         -142.06
No. Observations:                           31   AIC:                                      286.1
Df Residuals:                               30   BIC:                                      287.5
Df Model:                                    1                                                  
Covariance Type:                           HC3                                                  
==============================================================================================================
                                                 coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------
num_char_squared_X_num_traits_mean_squared  3.813e-08   1.11e-09     34.271      0.000    3.59e-08    4.03e-08
==============================================================================
Omnibus:                        0.036   Durbin-Watson:                   1.196
Prob(Omnibus):                  0.982   Jarque-Bera (JB):                0.091
Skew:                           0.049   Prob(JB):                        0.956
Kurtosis:                       2.754   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)

The only problem with this model for the testing of our hypothesis is that since there is such a high level of correlation between the independent variables average number of traits and year, it is hard to ascertain whether or not there is any part of the effect that is due to some kind of temporal change (such as an increasing trend towards relying on highly templated characters), as almost all the information for both variables is captured together by either one of the two. In order to solve this problem we will fix the number of traits at a given level and thereby try to gauge whether the temporal variable plays any role in the relationship. Before that, however, let’s see what we find for the relationships in the ACDB data.

Starting with the data on visual novel characters from ACDB we again find that it is the interaction term between the number of characters and the average number of traits variables that proves to be the single best explanatory variable for our data. Although in this case the interaction term is not between the squared terms of these two variables that provides the best fitting model, it is still noteworthy that the same effect of changes in the one variable being amplified by changes in the other is replicated for this dataset as well.

                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     numCharWithSharedTrait_mean   R-squared (uncentered):                   0.983
Model:                                     OLS   Adj. R-squared (uncentered):              0.982
Method:                          Least Squares   F-statistic:                              780.6
Date:                         Tue, 22 Sep 2020   Prob (F-statistic):                    4.29e-18
Time:                                 11:58:42   Log-Likelihood:                         -92.816
No. Observations:                           22   AIC:                                      187.6
Df Residuals:                               21   BIC:                                      188.7
Df Model:                                    1                                                  
Covariance Type:                           HC3                                                  
==============================================================================================
                                 coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------
num_char_X_num_traits_mean     0.0058      0.000     27.940      0.000       0.005       0.006
==============================================================================
Omnibus:                        4.374   Durbin-Watson:                   1.793
Prob(Omnibus):                  0.112   Jarque-Bera (JB):                2.466
Skew:                           0.742   Prob(JB):                        0.291
Kurtosis:                       3.699   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)

Examining the regression model with the best fit for our data on anime and other non-visual novel characters from ACDB we again find that it is the interaction term between the number of characters and the average number of traits variables that proves to be the single best explanatory variable. What is more, the fit of this model is only slightly worse than what we found in the case of the data on visual novel characters in the ACDB data, which further reinforces the notion that this is indeed the best explanatory relationship underpinning the changes in the average number of characters traits are shared with in both ACDB datasets.

                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     numCharWithSharedTrait_mean   R-squared (uncentered):                   0.962
Model:                                     OLS   Adj. R-squared (uncentered):              0.961
Method:                          Least Squares   F-statistic:                              838.2
Date:                         Tue, 22 Sep 2020   Prob (F-statistic):                    2.77e-30
Time:                                 12:19:25   Log-Likelihood:                         -169.20
No. Observations:                           45   AIC:                                      340.4
Df Residuals:                               44   BIC:                                      342.2
Df Model:                                    1                                                  
Covariance Type:                           HC3                                                  
==============================================================================================
                                 coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------
num_char_X_num_traits_mean     0.0033      0.000     28.952      0.000       0.003       0.004
==============================================================================
Omnibus:                        4.974   Durbin-Watson:                   1.289
Prob(Omnibus):                  0.083   Jarque-Bera (JB):                5.354
Skew:                           0.235   Prob(JB):                       0.0688
Kurtosis:                       4.623   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)

Fixing the average number of traits at a given level

Based on our regression analyses so far all results point towards the number of characters and the average number of traits being the only two independent variables that are responsible for the changes in the average number of characters traits are shared with. If this is true, then there is no relationship strong enough with the change in time to warrant us to accept the hypothesis we set up in relation to Azuma’s claim, namely that the portion of new characters with shared traits should increase over time.

However, we also know that there is a high level of correlation between the independent variables average number of traits and year in the VNDB dataset, thus it is hard to ascertain how much of the relationship between the average number of characters traits are shared with and the composite independent variable (the interaction term between the squared value of the number of characters and the squared value of the average number of traits) is a result of the change in the average number of traits or due to a potential temporal effect. As already mentioned above, in order to solve this problem we will fix the number of traits at a given level and thereby eliminate the effect of changes in the average number of traits (since it will be a constant value) and in this way attempt to capture the potential effect the temporal variable plays in the relationship. To decide on the level of average number of traits to work with, we chose to go with the number that had the most characters with that value to provide us with the most data to analyse. Thus, we chose the average number of traits to be eight, filtered our VNDB dataset to only include these characters, and redid all the calculations for this subset of the data.

The regression results of our best performing model for this scenario show two important results. First, this time the single explanatory variable in our model is the interaction term between the squared value of the number of characters and the year variable. This implies that there is indeed a positive relationship between the average number of characters traits are shared with and the time variable, which is no longer covered up by the effect of the average number of traits variable. Unfortunately, however, even though this was the single best regression model, the model fit and explanatory power is significantly below that of the models we saw previously. This is always a red flag, as it can indicate the presence of potential hidden variables that are in fact responsible for the variance in the values of the dependent variable.

                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     numCharWithSharedTrait_mean   R-squared (uncentered):                   0.710
Model:                                     OLS   Adj. R-squared (uncentered):              0.696
Method:                          Least Squares   F-statistic:                              28.85
Date:                         Tue, 22 Sep 2020   Prob (F-statistic):                    2.51e-05
Time:                                 11:47:12   Log-Likelihood:                          3.6231
No. Observations:                           22   AIC:                                     -5.246
Df Residuals:                               21   BIC:                                     -4.155
Df Model:                                    1                                                  
Covariance Type:                           HC3                                                  
===========================================================================================
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
num_char_squared_X_year  7.175e-07   1.34e-07      5.372      0.000    4.56e-07    9.79e-07
==============================================================================
Omnibus:                        0.003   Durbin-Watson:                   1.068
Prob(Omnibus):                  0.999   Jarque-Bera (JB):                0.179
Skew:                          -0.011   Prob(JB):                        0.914
Kurtosis:                       2.558   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)

In order to get a better sense of whether we should be worried about this lower level of fit or not, let us take a look at what happens if we fix the average number of traits for our two ACDB datasets, and redo all our calculations. Again, choosing the average number of traits with the highest number of corresponding characters in the ACDB datasets yields seven as the number of average traits to use for this analysis.

First, let us take a look at the results for the best regression model for visual novel characters from ACDB. Not only is the single explanatory variable the number of characters in this model, the fit is even higher than what we saw previously for this dataset. This is a very strong confirmation that our previous model featuring the interaction term between the number of characters and the average number of traits was correct, since now that we have removed the effect of the variance in average number of traits from our data the only remaining explanatory variable is exactly the number of characters, with the fit of the new model on our subset of characters with an average number of traits equal to seven proving even better than what we saw before.

                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     numCharWithSharedTrait_mean   R-squared (uncentered):                   0.991
Model:                                     OLS   Adj. R-squared (uncentered):              0.991
Method:                          Least Squares   F-statistic:                              1861.
Date:                         Fri, 02 Oct 2020   Prob (F-statistic):                    2.01e-20
Time:                                 13:04:35   Log-Likelihood:                         -34.810
No. Observations:                           20   AIC:                                      71.62
Df Residuals:                               19   BIC:                                      72.62
Df Model:                                    1                                                  
Covariance Type:                           HC3                                                  
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
num_char       0.0251      0.001     43.139      0.000       0.024       0.026
==============================================================================
Omnibus:                        0.125   Durbin-Watson:                   1.604
Prob(Omnibus):                  0.939   Jarque-Bera (JB):                0.056
Skew:                          -0.066   Prob(JB):                        0.972
Kurtosis:                       2.777   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)

Second, examining the best performing model for the data on anime and other non-visual novel characters from ACDB (with only characters that have an average number of traits value of seven) we find that for the first time so far we have a model with two explanatory variables. The first of these is again the number of characters, while the second is an interaction term between the number of characters and the year variables. The model has a very good fit, almost as good as our previous model’s fit was on the anime and other non-visual novel characters from ACDB. We should be glad to have isolated a similar effect as in the case of the VNDB data, namely that temporal change has an effect on the average number of characters traits are shared with, but wait, if we examine the coefficients, we see that in fact the relationship is the inverse of our expectation. This model captures an effect in which the progression of time decreases the positive effect of the number of characters variable on the values of the average number of characters traits are shared with variable.

                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     numCharWithSharedTrait_mean   R-squared (uncentered):                   0.950
Model:                                     OLS   Adj. R-squared (uncentered):              0.948
Method:                          Least Squares   F-statistic:                              250.3
Date:                         Fri, 02 Oct 2020   Prob (F-statistic):                    1.05e-23
Time:                                 11:21:31   Log-Likelihood:                         -136.86
No. Observations:                           43   AIC:                                      277.7
Df Residuals:                               41   BIC:                                      281.2
Df Model:                                    2                                                  
Covariance Type:                           HC3                                                  
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
num_char            0.0661      0.008      8.224      0.000       0.050       0.082
num_char_X_year    -0.0011      0.000     -4.780      0.000      -0.002      -0.001
==============================================================================
Omnibus:                        7.315   Durbin-Watson:                   1.722
Prob(Omnibus):                  0.026   Jarque-Bera (JB):                6.098
Skew:                           0.803   Prob(JB):                       0.0474
Kurtosis:                       3.907   Cond. No.                         187.
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)

Visualizing the relationship between three variables we need to move to a three dimensional representation. The animated gif image below was created based on the tutorial by Eric Kim.

Summarizing what we have learned from our fixing of the value of the average number of traits variable, we saw that: (1) for the VNDB data we had a not so well fitting model that corresponded to our hypothesis’ expectation of showcasing a positive relationship between the average number of characters traits are shared with and the passage of time; (2) for the ACDB visual novel characters we had an almost perfect fit where the only explanatory variable was the number of characters; and finally (3) for the ACDB anime and other non-visual novel characters we found a well fitting model that indicated a minor negative relationship between temporal change and the average number of characters traits are shared with. Based on these results, on the one hand it would be difficult to come to any conclusive result regarding our hypothesis, and on the other hand it seems well warranted to attempt to take another look at our VNDB data, since the ACDB datasets had behaved so differently following the fixing of the value of the average number of traits variable.

Adding further explanatory variables to our model

What could be the reason for this difference in the behavior between our datasets, and the difference in the results in relation to the significance and effect of the time variable? As explained in part two, one of the differences between the VNDB and ACDB datasets is their handling of character traits. Since ACDB traits are made up of a fixed and a variable part, while at the same time no VNDB traits are fixed in the same way, it is safe to assume that the structure of traits in the latter dataset could have an effect on the average number of characters traits are shared with. Furthermore, since we have already ascertained in relation to the average number of traits variable that there is a clear change over time in the attention that works and characters received in both datasets, it is not unlikely that a similar change over time could be present in the structure of traits used to describe characters in the VNDB data.

To try and capture this information about the change in the structure of traits describing characters in the VNDB data, we created a table of the frequency of all traits by year. Then we calculated the standard deviation of the distribution of traits for each year and added this information as a new independent variable to the data. Although, this variable does not capture changes such as certain traits or groups of traits coming to the fore or receding over the years, it does account for the concentration of the traits that occur for a given year. The higher the standard deviation for a given year, the more frequent a fewer number of traits are among all the traits present for that year. And our intuition is that with a fewer number of traits occurring more commonly, it is likely easier to find characters with higher numbers of shared traits.

In the table below we have the standard deviation of the trait distributions and the number of different traits for each year (for characters with eight traits). The standard deviation, of course, already takes into account the number of different traits, but it is worth taking a look at the top and bottom values of the table to get a better sense of what is happening in the data. For example, let us look at the years 2000 and 2018 which both have number of traits values somewhere in the range of around 360, and yet the standard deviation for 2000 is significantly higher than for 2018, even though the latter has the lower number of traits, which if all things would be equal would mean that it should have a slightly higher standard deviation. Comparing the standard deviation values for the later years in this way to earlier years with similar number of traits counts, we see this pattern repeating. What this means regarding the structure of the traits distributions is that in the later years there is a more even spread of traits, which based on our expectations should have an impact on the average number of characters traits are shared with.

traits distr. std. dev. no. of traits
1998 3.174968 193
1999 3.140945 186
2000 6.050863 369
2001 5.235086 314
2002 5.544881 389
2003 8.235311 427
2004 8.326114 414
2005 8.218588 405
2006 7.182349 395
2007 7.617247 429
2008 6.676748 439
2009 7.550598 461
2010 6.858286 424
2011 6.489878 428
2012 7.651332 432
2013 7.897157 403
2014 7.491977 416
2015 6.167307 416
2016 4.979518 501
2017 4.371030 411
2018 3.188420 351
2019 1.907479 234

As a side note, it would be possible to have very low standard deviation values corresponding to a perfect scenario for enabling the maximum possible average number of characters traits are shared with. For example, we could have all characters sharing the same eight traits for a given year, this would produce a standard deviation of zero, and yet our average number of characters traits are shared with value would be equal to the maximum possible value of the number of characters for that year minus one. However, considering the large number of traits for each year, it is safe to assume that the interpretation provided above holds for our dataset.

One more important point to note here before turning to our regression analysis results is that based on the above table we now know that it is not only the amount of attention paid to works and characters that changes over time (as we saw in part 2) in the VNDB data, but also the structure of the way details are filled in about the characters in the database. Or alternatively, if the VNDB data is a good representation of what is happening on the production side of the visual novel world, the above discussed changes in the distribution of the traits for each year could signal a change in the structure of how new characters share traits among each other, a point we will return to in part four of this series.

Our model evaluation starts getting trickier now that we have included the standard deviation of the traits distribution in our array of possible explanatory variables. There is no single best performing model, rather we have two models competing with each other. First, we have the single best performing explanatory variable by itself. This variable is the interaction term between the squared value of the standard deviation of the trait distribution and the year variable, which is, of course, in other words the interaction term between the variance of the trait distribution and the year variable (the standard deviation being the square root of the variance). This model has a significantly better fit then our previous model had, which is always a good sign, and what is more, it again indicates the role of the time variable in accordance with our expectations based on our hypothesis.

                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     numCharWithSharedTrait_mean   R-squared (uncentered):                   0.859
Model:                                     OLS   Adj. R-squared (uncentered):              0.852
Method:                          Least Squares   F-statistic:                              113.5
Date:                         Sat, 03 Oct 2020   Prob (F-statistic):                    6.31e-10
Time:                                 15:21:21   Log-Likelihood:                          11.555
No. Observations:                           22   AIC:                                     -21.11
Df Residuals:                               21   BIC:                                     -20.02
Df Model:                                    1                                                  
Covariance Type:                           HC3                                                  
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
traits_distr_stdDev_squared_X_year 0.0007    6.2e-05     10.655      0.000       0.001       0.001
==============================================================================
Omnibus:                        1.481   Durbin-Watson:                   1.068
Prob(Omnibus):                  0.477   Jarque-Bera (JB):                1.164
Skew:                           0.353   Prob(JB):                        0.559
Kurtosis:                       2.122   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)

However, as indicated above, there is another contender for the best performing regression model in this case. This second model features four explanatory variables, which is a very strong red flag, especially with so few data points and such high levels of multicollinearity among the independent variables (which just means that the information captured by one explanatory variable is also replicated on combinations of the other explanatory variables, making them redundant in a way). In fact, this model is most likely a case of overfitting, where we build a model that is overly specific to the data points we have, but which would most likely start performing poorly should we get our hands on further data points. Nevertheless, the model fit is quite good, and even according to the statistical measures, that penalize the inclusion of more variables in the model, this model is clearly better than the previous one. So, instead of discounting this model straight away, let us see what it tells us about the role of the temporal variable in the changes in the average number of characters traits are shared with.

Out of the four explanatory variables (namely (a) the standard deviation of the trait distribution, (b) the variance of the trait distribution, (c) interaction term between the variance of the trait distribution and the year variables, and (d) the interaction term between the number of characters and the year variables), two are interaction terms with the year variable, however, with opposite signs. What does this mean then for the effect of the passage of time on our independent variable? Well, looking at the coefficients of the two interaction term variables we can see that the magnitude of the effect of the interaction term between the variance of the trait distribution and the year variables is nine times that of the interaction term between the number of characters and the year variables. Taking the signs of the two explanatory variables into account this means that, if the number of characters is more than nine times the variance of the trait distribution, moving forward in time will result in a decrease in the average number of characters traits are shared with, otherwise it will have a positive impact on the same number. So, this means that the effect of the passage of time on our independent variable is not a definite positive or negative effect, like we saw in the previous models, but is rather an ‘it depends’.

                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     numCharWithSharedTrait_mean   R-squared (uncentered):                   0.950
Model:                                     OLS   Adj. R-squared (uncentered):              0.939
Method:                          Least Squares   F-statistic:                              125.3
Date:                         Sat, 03 Oct 2020   Prob (F-statistic):                    6.99e-13
Time:                                 14:49:46   Log-Likelihood:                          23.065
No. Observations:                           22   AIC:                                     -38.13
Df Residuals:                               18   BIC:                                     -33.77
Df Model:                                    4                                                  
Covariance Type:                           HC3                                                  
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
traits_distr_stdDev_squared       -0.0170      0.004     -4.169      0.000      -0.025      -0.009
num_char_X_year                   -0.0002   4.06e-05     -5.936      0.000      -0.000      -0.000
traits_distr_stdDev_squared_X_year 0.0018      0.000     11.246      0.000       0.001       0.002
traitsNonZeroSE                    0.1093      0.030      3.679      0.000       0.051       0.168
==============================================================================
Omnibus:                        0.310   Durbin-Watson:                   1.668
Prob(Omnibus):                  0.856   Jarque-Bera (JB):                0.480
Skew:                           0.118   Prob(JB):                        0.787
Kurtosis:                       2.316   Cond. No.                     3.07e+03
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)
[2] The condition number is large, 3.07e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fixing the number of characters at a given level as well

Even though the previous model with its four explanatory variables is probably a case of overfitting, it still helps us further our examination. What if we were to fix the number of characters in our data, like we did with the number of traits? Examining the VNDB data for characters with eight traits we found that between 1998-2019 there are more than fifty characters per year, thus if we take a random sample of this size for each year and redo our calculations with that dataset we will have a subset of the original data in which both the number of characters and the average number of traits variables are constant.

The best performing model in this case features two explanatory variables, the standard deviation of the trait distribution and its squared term. Thus, by eliminating the variance in the number of characters from our data, the role of the temporal dimension is no longer found in the model. It is also worth noting that the model fit here is even slightly worse than the so far worst performing model we saw when we first fixed the number of traits. One potential explanation for this is that the data is far more subject to the effects of random variation since it is averaged from a significantly lower character count, we could test for this in the future by checking multiple samples. Of course, the possibility of further latent variables we have not included in our lineup of potential explanatory variables is again a potential alternative to the simple presence of random noise that cannot be accounted for.

                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     numCharWithSharedTrait_mean   R-squared (uncentered):                   0.718
Model:                                     OLS   Adj. R-squared (uncentered):              0.689
Method:                          Least Squares   F-statistic:                              13.00
Date:                         Mon, 05 Oct 2020   Prob (F-statistic):                    0.000242
Time:                                 11:24:55   Log-Likelihood:                          23.954
No. Observations:                           22   AIC:                                     -43.91
Df Residuals:                               20   BIC:                                     -41.73
Df Model:                                    2                                                  
Covariance Type:                           HC3                                                  
======================================================================================================
                                         coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
traits_distribution_stdDev_squared     0.0786      0.030      2.624      0.009       0.020       0.137
traits_distribution_stdDev            -0.1449      0.067     -2.171      0.030      -0.276      -0.014
==============================================================================
Omnibus:                        0.527   Durbin-Watson:                   1.216
Prob(Omnibus):                  0.768   Jarque-Bera (JB):                0.309
Skew:                          -0.276   Prob(JB):                        0.857
Kurtosis:                       2.820   Cond. No.                         19.8
==============================================================================

Warnings:
[1] Standard Errors are heteroscedasticity robust (HC3)

Since the two explanatory variables of the model are the same variable raised to the first and second power, we can plot the model as a quadratic function of the standard deviation of the trait distribution.

Conclusion

We found that our first hypothesis (the portion of new characters with shared traits should increase over time) has not been substantiated by our regression analyses. It is important that we cannot fully rule out that the hypothesis might still be correct (maybe our data is too skewed, or our analysis too faulty), however, based on our results thus far we feel quite confident that it is time to seriously consider the alternative: namely that the the portion of new characters with shared traits does not increase over time, and might even demonstrate a slight opposite trend as we saw in the case of the ACDB data for non-visual novel characters. This will be the subject of the next and final part in this blogpost series.

As a final note, what happened with our second hypothesis ( the portion of shared traits among new characters should increase over time)? Interestingly enough there were no models to be found that could adequately account for the changes in the average number of shared traits in any of the three datasets, even when introducing the variable standard deviation of the trait distribution for the VNDB data. This result is definitely interesting, and will have to be the subject of further analysis in the future.

Leave a Reply

Your email address will not be published. Required fields are marked *