Experiments in entity matching

Our project has so far transformed data from multiple online enthusiast communities as well as other data-centric projects into the RDF data format and it is now available online as Linked Open Data. The next step is matching the corresponding entities in each source and merge the properties that each source provides.

As we have learned from our data quality analysis, title information in both English and Japanese language is very reliable and true factual errors are quite rare. But we did notice a lot of typographical inconsistencies like extra or missing spaces, wrong glyphs for specific dashes, wrong types of apostrophes or punctuation marks and similar mistakes. In order to eliminate this potential source of problems, one would simply drop the affected characters from the title information before comparing. This is actually a well-established practice in information retrieval, where the removal of such characters is routinely employed in the “normalization” step of indexing, among other transformations like removing capitalization.

But with the removal of any character comes a loss of information, and in the case of Japanese visual media, this information can be vital for distinguishing between different media. As an example, let us have a look at the TV series based on a manga with the original title “WORKING!!”.

Its first season is titled “WORKING!!”, its second ” WORKING`!! ” and its third “WORKING!!!”. Removing the usual suspects from the title information will create ambiguity as all three seasons would be identified by the normalized title of “working”. But keeping them could lead to mismatches, as some sources might use a different apostrophe glyph or the exclamation mark.

The good news is that by performing a normalization transformation of the title data (which includes the removal of all problematic characters) we can be quite sure that all potential matches will be found. There will of course be some ambiguous matches, but these can easily be identified by checking if a source has multiple entries for the same normalized title. Human intervention can then be used to find the correct matches for each ambiguous match set.

Here are some results from matching anime entries from several data sources using the normalized Japanese title information.

Anime Characters Database <-> AniDB

Base: 3323
1:1 matches: 2382 
Ambiguous matches: 263
Remaining: 678
Animeclick.it <-> AniDB

Base: 7467
1:1 Matches: 6334
Ambiguous matches: 397
Remaining: 736

“Base” is the number of anime entries in the source on the left side. AniDB has the largest amount of anime entries of all our sources, so we iterate through the smaller dataset and search for a corresponding title in AniDB. We can see that 71% and 84% can be matched unambiguously, with only a small number of ambiguous matches. In comparison, quite a few entries cannot be matched at all, which is puzzling given our original results of the data quality analysis. One reason could be a difference in coverage of the existing media.

Currently, two student assistants are manually checking the ambiguous matches while members of the project team try to manually find the corresponding entities for the unmatched entries. It will be interesting to find out what the reasons for these failed matches are.