Named entity disambiguation (ULB, Resic Department)

Let’s take the sentence, “Dallas, it’s 9 a.m.: I’m walking down the street where it happens, 55 years ago…” Could you guess where the action takes place ? Of course, you may say: Dallas, in Texas, and the “it happens” certainly refers to the assassination of President Kennedy ! If it seems crystal clear to us, it’s only because we, humans, are very good at solving the ambiguities of natural language.

For a machine, however, “Dallas” could very well be a person’s name – and this scene a dialogue. Even specifying to the computer that Dallas is a place name [1], this question would remain : what Dallas are we talking about ? According to Wikipedia, at least twenty localities around the world share this name. Without sufficient clues (we mean : clues understandable by a machine), the computer will probably choose the most popular candidate. For example, the one with the longest Wikipedia page, or the most viewed.

This process of trying to link automatically words in a document (eg place names) to an external database (in this case, Wikipedia) is called in the scientific literature “Entity Linking”, “Named Entity Disambiguation”, “Reconciliation”, “Entity resolution”, and so on. Although much progress has been made since the seminal works [2], the task is far from being solved. Its difficulty varies according to the quality and the length of the text. A tweet or a SMS, for example, can be ambiguous even for a human. The same goes for short photo captions, as in the case of the Cegesoma database. The temporal context is also important. When a picture of the 1940s says that the scene takes place in “Hamme, Belgium”, we must not forget that, at the time, at least three belgian municipalities had this name, in three different provinces. Finally, it often happens that the named entities recognition extracts in the same text several place names. How to guess which one is the location where the picture was taken ?

The method we are working on uses clues from the photo database. In this case, the thesaurus keywords that Cegesoma’s archivists have applied to folders containing groups of photos related to the same theme. These keywords sometimes contain a place name, such as a city, a province, or a region. Using these terms, which we parsed out from the keywords, we will apply an algorithm [3] that will query a database called Wikidata – a kind of Wikipedia containing structured information, readable both by humans and machines. For each location previously extracted from the photo captions, the algorithm will select the possible candidates and choose the most probable based on the available clues. When multiple places are mentioned for the same picture, it will use the same clues to rule out the least likely. Finally, if the ambiguity is too strong, it will not try to guess further and will simply indicate that this text or this photo must be verified by a human.

[1]      Which is the role of the named entities recognition, see section ???
[2]              Rao, D., McNamee, P., & Dredze, M. (2013). Entity linking: Finding extracted entities in a knowledge base. In Multi-source, multilingual information extraction and summarization (pp. 93-115). Springer, Berlin, Heidelberg.
[3]      Written in Python. Still in development phase, but the code will be soon available online.

1 Comment on "Named entity disambiguation (ULB, Resic Department)"

Sort by: newest | oldest | most voted

Guest

1xbet рабочее на сегодня

Share On Twitter Share On Google

Букмекерская контора 1xBet является очень популярных на рынке. 1xbet рабочее на сегодня Большой выбор спортивных и киберспортивных событий, множество открытых линий, высочайшие коэффициенты. Также, БК имеет широкий функционал и немногие дает возможность совершать ставки по специальным промокодам. Используя промокоды, вы можете получить реальный денежный выигрыш, не внося абсолютно никаких средств. Это реально! Узнать актуальный промокод вы можете прямо сейчас, однако использовать его необходимо в соответствии с условиями и инструкциями, которые приведены ниже.

Named entity disambiguation (ULB, Resic Department)

Published by Samnang Nop

Leave a Reply