Relationship Queries on Extended Knowledge Graphs
Mohamed Yahya
1
, Denilson Barbosa
2
, Klaus Berberich
1
, Qiuyue Wang
3
, Gerhard Weikum
1
1
Max-Planck Institute for Informatics
2
University of Alberta
3
Renmin University of China
{myahya, kberberi, weikum}@mpi-inf.mpg.de
denilson@ualberta.ca qiuyuew@ruc.edu.cn
ABSTRACT
Entity search over text corpora is not geared for relation-
ship queries where answers are tuples of related entities and
where a query often requires joining cues from multiple doc-
uments. With large knowledge graphs, structured querying
on their relational facts is an alternative, but often suffers
from poor recall because of mismatches between user queries
and the knowledge graph or because of weakly populated re-
lations.
This paper presents the TriniT search engine for querying
and ranking on extended knowledge graphs that combine
relational facts with textual web contents. Our query lan-
guage is designed on the paradigm of SPO triple patterns,
but is more expressive, supporting textual phrases for each
of the SPO arguments. We present a model for automatic
query relaxation to compensate for mismatches between the
data and a user’s query. Query answers – tuples of entities
– are ranked by a statistical language model. We present
experiments with different benchmarks, including complex
relationship queries, over a combination of the Yago knowl-
edge graph and the entity-annotated ClueWeb’09 corpus.
Keywords
Relationship Queries; Extended Knowledge Graphs; Query
Relaxation
1. INTRODUCTION
1.1 Motivation and Problem
Searching for entities and associated properties has re-
ceived much attention for both web contents and enterprise
documents. Examples are queries for “musicians who con-
tributed to movie soundtracks” or “companies that acquired
Internet startups”. IR-centric approaches typically associate
entities with statistical language models in order to match
and rank entity mentions and surrounding phrases in text
corpora [3]. Semantic-Web-style approaches, on the other
hand, rather tap structured knowledge graphs (KGs) such
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
WSDM’16, February 22–25, 2016, San Francisco, CA, USA.
c
2015 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-3716-8/16/02. . . $15.00
DOI: http://dx.doi.org/10.1145/2835776.2835795
as Freebase or Linked Open Data (LOD) collections such as
combinations of DBpedia, Yago, and MusicBrainz, and use
SPARQL queries to retrieve relevant RDF triples [16].
Neither of these paradigms provides good support for re-
lationship queries that connect multiple entities in a specific
way and return tuples of connected entities. Consider, for
example, the task of finding songs that appear in movies
and returning a list of hsong, moviei pairs. This cannot be
fully expressed by IR-centric entity search which is bound to
return spurious results where a movie is merely mentioned in
the textual proximity of a song (e.g., “My Way” and the film
“The Man with the Golden Arm” in a Frank Sinatra biogra-
phy). On the other hand, the SPARQL language supports
expressive and precise queries over RDF graphs of entity-
relationship triples, but its results are limited by the prop-
erties (i.e., relation types, binary predicates) and facts (i.e.,
relation instances, predicate arguments) that the underlying
KG or LOD collection contains. So none of the established
paradigms can adequately cope with relationship queries.
Examples: In principle, the above example could be for-
mulated by the following SPARQL query:
SELECT ?s ?m WHERE {
?s type song . ?m type movie . ?s musicInFilm ?m }
where ?s and ?m are variables and the second line contains
three triple patterns over the subject-predicate-object (SPO)
triples of the underlying KG. The query should return bind-
ings for song-movie pairs that are in the desired relationship.
However, this works only if the KG does indeed offer the
predicate musicInFilm and that predicate is sufficiently well
populated. If the KG instead contained predicates filmHas-
Soundtrack of type movie × album and albumContainsSong
of type album × song, the user would need to formulate a
very different query and non-expert users would typically
fail to get this right. Even if the user succeeded in posing
the best query formulation, the answers would be limited
by the facts of the KG, while the web or social media could
potentially hold many additional answers.
Similar cases arise in domains like business or sports, for
example, when searching for “South-American football play-
ers and European championships that they have won” (with
answers such as Lionel Messi and the UEFA Champions
League). Here one challenge that users would struggle with
is to properly formulate the two-hop join query with triple
patterns ?p type player . ?p playedFor ?t . ?t won ?c .
?t type footballClub as KGs associate championships with
teams, not players. IR-style entity search would be more
convenient for users, but loses precision and would be mis-