Home My Page Projects melinda
Summary Activity Forums Tracker Lists Tasks Docs News SCM Files Mediawiki


Debugging Wikipedia and DBPedia

From melinda Wiki
Jump to: navigation, search

This page presents a method to debug Wikipedia and DBpedia based on DBPedia and pseudo-key computation.



While experimenting on RDF_keys we tried the algorithm on various datasets including DBPedia. We degraded the notion of keys to also consider pseudo-keys, in order to be able to deal with data presenting defaults. In fact we realized that pseudo-keys are highlighting weird data in the dataset as they show that a very small amount of ressources share similar values for a property set, while all other instances have different values for this property set. Observing these data showed errors in DBPedia, Wikipedia, or even data sources used as a base for the Wikipedia article.


Computing pseudo keys

We apply the RDF_keys#Algorithm for RDF key computation on the DBPedia dataset. We have computed keys for 242 classes of the DBPedia dataset. Minimum support was set to 0.1 and the discriminability threshold was set to 0.99. Results are available for download as RDF/N3 under the format introduced in the RDF_keys#Example: here.

Here are a few keys for the class Person having a discriminability greater than 0.999 but strictly lower than 1.

http://dbpedia.org/ontology/deathDate http://dbpedia.org/ontology/birthDate
http://dbpedia.org/ontology/deathDate http://dbpedia.org/ontology/deathPlace
http://xmlns.com/foaf/0.1/name http://dbpedia.org/ontology/birthPlace
http://xmlns.com/foaf/0.1/surname http://purl.org/dc/elements/1.1/description
http://dbpedia.org/ontology/deathPlace http://dbpedia.org/ontology/birthDate

While not impossible, it is statistically rare to have two persons born at the same date who died at the same date. It would even be more unlikely if the these persons also died at the same place as the second key might suggest.

Querying DBPedia to retrieve potentially erronous data

We apply a transformation of keys into SPARQL queries. We select distinct resources having the same values for the properties in the key for a given class. In the example of DBPedia:Person on first key given above we obtain the following query.

Select distinct ?x ?y 
Where {
  ?x dbpedia-owl:deathDate ?dd; 
      dbpedia-owl:birthDate ?bd; 
      rdf:type dbpedia-owl:Person.
  ?y dbpedia-owl:deathDate ?dd; 
      dbpedia-owl:birthDate ?bd; 
      rdf:type dbpedia-owl:Person.
  Filter (?x!=?y) 

This query can be executed on DBPedia SPARQL endpoint by following this link.

Validating the results

All 125 pairs of resources returned by the query indicate errors. Either in DBPedia, in Wikipedia, or in the data sources used to write the Wikipedia article.

  • The most frequent error arises when two resources exist for describing a same object. For example dbpedia:Louis_IX_of_France__Saint_Louis___1 and dbpedia:Louis_IX_of_France. This error leads to a problem in Wikipedia where two distinct pages exist for the same person.
  • A second type of errors come from DBPedia extraction process. Some infoboxes are not correctly extracted, leading to misclassification of the corresponding resources. For instance dbpedia:Timeline_of_the_presidency_of_John_F._Kennedy is of type dbpedia-owl:Person while it in fact refers to a timeline.
  • A third type of errors come from Wikipedia pages themeselves or from documents used to fill the articles. See for example the Wikipedia page corresponding to [1]

Going Further

Here is a short todo list:

  • Automating query generation from Keys to SPARQL queries.
  • Debugging tool showing both DBPedia/Wikipedia pages with direct buttons to report the bug/edit the Wikipedia.