Knowledge Graph Query Processing and Benchmarking
, funded by
NSF IIS 1528175.
This material is based upon work supported by the National
Science Foundation under Grant No. 1528175. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science
Foundation.
Project Summary
Today, if a user has a question, using
Google or Bing, she still has to read through multiple web pages to find
answers. This paradigm is now changing due to the rise of mobile devices.
Over the last decade, it was witnessed
that many systems aim to answer queries directly, e.g., using knowledge graphs
collected from the Internet or through crowdsourcing.
A real sea change in information search
is coming! A broad range of new
applications are emerging in intelligent policing, personal assistance,
individualized healthcare, legal services, scientific literature search, and
recently robotics. This project has the potential to make
fundamental advances in querying heterogeneous knowledge graphs, which are
ubiquitous. It will open up a set of
new knowledge base applications in fast growing areas such as social networks,
intelligence analysis, and medical research.
It is going to significantly ease query formulation and improve search
quality/speed in these applications.
Given the high data heterogeneity in
knowledge graphs, writing structured queries that fully comply with data
specification is extremely hard for ordinary users, while keyword queries can be
too ambiguous to reflect user search intent. The situation becomes even worse
when there are various representations for the same entity or relation.
It is expected that a sophisticated query
system shall be able to support different concept representations without
forcing users to use very controlled vocabulary. It shall provide simple
mechanisms to users so that they can quickly come up with a right query either
explicitly or implicitly (e.g., via relevance feedback).
This proposal is going to develop such system, make it user-friendly and
scalable.
The proposed research includes a plan
to build a flexible query benchmark that is able to cope with heterogeneous,
large-scale knowledge graphs, as well as user specified configurations and
performance metrics. Benchmarks are
indispensable for rapid development of database research.
There were many successful examples of
how robust and meaningful benchmarks can greatly expedite the development of a
research area. The query
benchmark proposed in this project is timely needed.
It is going to (1) provide a standardized
way to fairly and comprehensively evaluate different knowledge graph query
algorithms, (2) improve the understanding of the existing query engines, and (3)
advance the area by getting researchers involved in the same play ground for
building better, faster, and more intelligent methods.
Benchmark Dataset Release
1. GraphQuestions: A Characteristic-rich Question
Answering Dataset [github]
[pdf]
Natural language question answering (QA), i.e.,
finding direct answers for natural language questions, is undergoing active
development. Questions in real life often present rich characteristics,
constituting dimensions along which question difficulty varies. The aim of this
project is to explore how to construct characteristic-rich QA dataset in a
systematic way, and provide the community with a dataset with rich and
explicitly specified question characteristics. A dataset like this enables
fine-grained evaluation of QA systems, i.e., developers can know exactly on what
kind of questions their systems are failing, and improve accordingly.
We
present GraphQuestions, a QA dataset consisting of a set of factoid questions
with logical forms and ground-truth answers. The current release (v1.0) of the
dataset contains 5,166 questions, which are constructed based on Freebase, a
large-scale knowledge base. An array of question characteristics are formalized,
and every question has an explicit specification of characteristics:
Structure Complexity: the number of relations involved in a question
Function: Addtional functions like counting or superlatives, e.g., "How many
children of Ned Stark were born in Winterfell?"
Commonness: How common a
question is, e.g., "where was Obama born?" is more common than "what is the tilt
of axis of Polestar?"
Paraphrasing: Different natural language expressions of
the same question
Answer Cardinality: The number of answers to a question
2. GloREPlus: Global Textual Relation Embedding for Relational
Understanding [github] [pdf]
Data and code release for ACL 2019 paper "Global Textual
Relation Embedding for Relational Understanding"
Publications
-
Global Textual Relation Embedding for Relational Understanding,
by Z.
Chen, H. Zha, H. Liu, W. Chen, X. Yan and Y. Su,
ACL'19
(Proc. of the Annual Meeting of the Association for Computational
Linguistics) [pdf]
-
What It Takes to Achieve 100% Condition Accuracy on WikiSQL,
by S. Yavuz, I. Gur, Y. Su, X. Yan,
EMNLP'18
(Proc. of the 2018 Conference on Empirical Methods in Natural
Language Processing) [pdf]
- DialSQL: Dialogue Based Structured Query Generation,
by I. Gur, S.
Yavuz, Y. Su, X. Yan,
ACL'18
(Proc. of the Annual Meeting of the Association for Computational
Linguistics, 2018) [pdf]
- Variational Knowledge
Graph Reasoning,
by W. Chen, W. Xiong, X. Yan and
W. Wang,
NAACL-HLT'18 (Proc. of
the 16th North American Chapter of ACL: Human Language Technologies, 2018)
[pdf]
- Global Relation Embedding for Relation Extraction
by Yu Su*, Honglei Liu*, Semih Yavuz, Izzeddin Gur, Huan Sun, Xifeng
Yan. [pdf] [code]
(*: Equal Contribution)
https://arxiv.org/abs/1704.05958, April 2017
NAACL-HLT'18 (Proc. of
the 16th North American Chapter of ACL: Human Language Technologies, 2018)[pdf]
- Scalable Construction and Querying of Massive Knowledge Bases
(Tutorial),
by X. Ren, Y. Su, P. Szekely, X. Yan.
WWW'18 (Proc. of the International Conference
on World Wide Web), 2018 [website][slides1][slides2][slides3] - Construction and Querying of
Large-scale Knowledge Bases (Tutorial),
by X. Ren, Y. Su, X. Yan.
CIKM'17(Proc. of the
ACM International Conference on Information and Knowledge Management), 2017
[website][slides] - Cross-domain Semantic Parsing via Paraphrasing,
by Y.
Su, X. Yan.
EMNLP'17 (Proc. of the 2017 Conf. on Empirical
Methods in Natural Language Processing), 2017 [pdf] - Recovering Question Answering Errors via Query Revision,
by S. Yavuz, I.
Gur, Y. Su, X. Yan.
EMNLP'17 (Proc.
of the 2017 Conference on Empirical Methods in Natural Language Processing),
2017 [pdf]
- On Generating Characteristic-rich Question Sets for QA Evaluation,
by Y.
Su, H. Sun, B. Sadler, M. Srivatsa, I. Gur, Z. Yan, and X. Yan,
EMNLP'16
(Proc. of the
2016 Conf. on Empirical Methods in Natural Language Processing) 2016 [pdf] - Improving Semantic Parsing via Answer Type Inference,
by S. Yavuz, I. Gur,
Y. Su, M. Srivatsa, X. Yan,
EMNLP'16
(Proc. of the 2016 Conf. on Empirical Methods in
Natural Language Processing), 2016 [pdf] - Semantic SPARQL Similarity Search Over RDF Knowledge Graphs,
by W. Zheng, L.
Zou, W. Peng, X. Yan, S. Song, D. Zhao,
VLDB'16
(Prof. of the 42nd International
Conference on Very Large Data Bases), 2016. [pdf] - Exploiting Relevance Feedback in Knowledge Graph Search,
by Y. Su, S. Yang,
H. Sun, M. Srivatsa, S. Kase, M. Vanni and X. Yan,
KDD'15
(Proc. of Int. Conf. on Knowledge Discovery and Data Mining),
2015 [pdf] - SLQ: A User-friendly Graph Querying System,
by S. Yang, Y. Xie, Y.
Wu, T. Wu, H. Sun, J. Wu, X. Yan,
SIGMOD'14
(Proc. 2014 Int. Conf. on Management of Data) (demo paper), 2014. [pdf]
[demo] - Schemaless and Structureless Graph Querying,
by S. Yang, Y. Wu, H. Sun,
X. Yan,
VLDB'14
(Proc. of the 40th Int. Conf. on Very Large Databases),
2014. [pdf]
Dissertations
1. "Towards Democratizing Data Science with Natural
Language Interfaces,"
Yu Su, 2018 [pdf]