FLOC 2018: FEDERATED LOGIC CONFERENCE 2018
Leveraging Probabilistic Existential Rules for Adversarial Deduplication

Authors: José Paredes, Maria Vanina Martinez, Gerardo Simari and Marcelo A. Falappa

Paper Information

Title:Leveraging Probabilistic Existential Rules for Adversarial Deduplication
Authors:José Paredes, Maria Vanina Martinez, Gerardo Simari and Marcelo A. Falappa
Proceedings:PRUV PRUV 2018 Proceedings
Editors: Thomas Lukasiewicz, Rafael Peñaloza and Anni-Yasmin Turhan
Keywords:Deduplication, Entitity Resolution, Cyber Security, Ontology Languages, Existential Rules
Abstract:

ABSTRACT. The entity resolution problem in traditional databases, also known as deduplication, seeks to map multiple virtual objects to its corresponding set of real-world entities. Though the problem is challenging, it can be tackled in a variety of ways by means of leveraging several simplifying assumptions, such as the fact that the multiple virtual objects appear as the result of name or attribute ambiguity, clerical errors in data entry or formatting, missing or changing values, or abbreviations. However, in cyber security domains the entity resolution problem takes on a whole different form, since malicious actors that operate in certain environments like hacker forums and markets are highly motivated to remain semi-anonymous---this is because, though they wish to keep their true identities secret from law enforcement, they also have a reputation to keep with their customers. The above simplifying assumptions cannot be made, and we therefore coin the term "adversarial deduplication" to refer to this setting. In this paper, we propose the use of probabilistic existential rules (also known as Datalog+/-) to model knowledge engineering solutions to this problem; we show that tuple-generating dependencies can be used to generate probabilistic deduplication hypotheses, and equality-generating dependencies can later be applied to leverage existing data towards grounding such hypotheses. The main advantage with respect to existing deduplication tools is that our model operates under the open-world assumption, and thus is capable of modeling hypotheses over unknown objects, which can later become known if new data becomes available.

Pages:15
Talk:Jul 19 17:00 (Session 136D: PRUV regular papers)
Paper: