The NER and computer vision phase result in a list of names of locations mentioned in the photo captions or found by the visual place recognition algorithm. However, as already discussed, this information will frequently be too vague to perform a good location estimation or will lead to conflicting NER and computer vision results. For example: “Gare du Nord” does not tell us much if we don’t have additional location info. Which “Gare du Nord” is it? Is it in Brussels or Paris? And what does “Cinquantenaire” mean, which has several pages in Wikipedia? To find the answer, we need to be able to link each of these locations to a standardized entry in a database, then choose the one that is most likely the place where the picture was taken. This process of trying to automatically link words in a document (e.g. place names) to an external database (e.g. Wikipedia) is called “Entity Linking”, “Named Entity Disambiguation”, “Reconciliation” or “Entity resolution”. Although much progress has been made since the seminal works, the task is far from being solved. Its difficulty varies according to the quality and the length of the text. A tweet or a SMS, for example, can be ambiguous even for a human. The same goes for short photo captions. The temporal context is also important. When a picture taken in 1940s says that the scene took place in “Hamme, Belgium”, we must not forget that, at the time, at least three Belgian municipalities had this name, in three different provinces.

The proposed method of disambiguation uses original clues from the photo database, i.e., the thesaurus keywords that archivists have applied to folders containing groups of photos related to the same theme. These keywords sometimes contain a place name, such as a city, a province, or a region. Using these terms, which we parsed out from the keywords, we will apply an algorithm that will query Wikidata[1] using SPARQL[2] queries and API calls. For each location previously extracted from the photo captions and the computer vision output, the algorithm will select the possible candidates and choose the most likely based on the available clues.

When multiple places are mentioned in the same picture, it will use the same clues to rule out the least likely. The results of the computer vision are also used to perform the disambiguation. In the case of “Cinquantenaire”, which can refer to both a park, a museum or a bridge, the classes mentioned by the computer vision (e.g. “Bridge”) will help to determine the most appropriate entry. When computer vision predictions seem to contradict each other (for example if the place can be a church, a synagogue or a palace at the same time), Wikidata’s ontology is used to find the broader category that “subsume” these classes (in this case, the superclass “Building”). Finally, if the ambiguity is too strong, the system will not try to guess further and will simply indicate that this text or this photo must be verified by the crowd. The first results obtained from a sample of pictures are encouraging and must now be tested on a larger scale and later subjected to a human evaluation.


[1]    https://www.wikidata.org – a kind of structured version of Wikipedia, readable both by humans and machines

[2]    https://www.w3.org/rdf-sparql-query