Authors: José Paredes, Maria Vanina Martinez, Gerardo Simari and Marcelo A. Falappa
Paper Information
Title: | Leveraging Probabilistic Existential Rules for Adversarial Deduplication |
Authors: | José Paredes, Maria Vanina Martinez, Gerardo Simari and Marcelo A. Falappa |
Proceedings: | PRUV PRUV 2018 Proceedings |
Editors: | Thomas Lukasiewicz, Rafael Peñaloza and Anni-Yasmin Turhan |
Keywords: | Deduplication, Entitity Resolution, Cyber Security, Ontology Languages, Existential Rules |
Abstract: | ABSTRACT. The entity resolution problem in traditional databases, also known as deduplication, seeks to map multiple virtual objects to its corresponding set of real-world entities. Though the problem is challenging, it can be tackled in a variety of ways by means of leveraging several simplifying assumptions, such as the fact that the multiple virtual objects appear as the result of name or attribute ambiguity, clerical errors in data entry or formatting, missing or changing values, or abbreviations. However, in cyber security domains the entity resolution problem takes on a whole different form, since malicious actors that operate in certain environments like hacker forums and markets are highly motivated to remain semi-anonymous---this is because, though they wish to keep their true identities secret from law enforcement, they also have a reputation to keep with their customers. The above simplifying assumptions cannot be made, and we therefore coin the term "adversarial deduplication" to refer to this setting. In this paper, we propose the use of probabilistic existential rules (also known as Datalog+/-) to model knowledge engineering solutions to this problem; we show that tuple-generating dependencies can be used to generate probabilistic deduplication hypotheses, and equality-generating dependencies can later be applied to leverage existing data towards grounding such hypotheses. The main advantage with respect to existing deduplication tools is that our model operates under the open-world assumption, and thus is capable of modeling hypotheses over unknown objects, which can later become known if new data becomes available. |
Pages: | 15 |
Talk: | Jul 19 17:00 (Session 136D: PRUV regular papers) |
Paper: |