北京大学王选所数据管理实验室

Teaching & Research

Research Overview

Large Scale Graph Data Management

Traditional RDBMS (Relational DBMS) requires pre-defined schemas. Thus, we often call RDBMS as schema-first system. However, some emerging applications do not have explicit or fixed schemas. Therefore, it is challenge to predefine schemas in such applications. In order to address this issue, many No-SQL systems such as Key-Value (KV) stores are widely applied in big data management. However, KV-stores ignore the connections between different entities. Graph data management proposes to use “graph” to model the complex connections between each other. In this project, we focus on how to design structure-aware index and query optimization strategy for different kinds of graph queries, such as subgraph matching query and path query.

Representative Work:

[1] Liang Hong, Lei Zou, Xiang Lian, Philip S. Yu: Subgraph Matching with Set Similarity in a Large Graph Database. IEEE Trans. Knowl. Data Eng. (TKDE) 27(4): 964-978 (2015)

[2] Weiguo Zheng, Lei Zou, Xiang Lian, Dong Wang, Dongyan Zhao: Efficient Graph Similarity Search Over Large Graph Databases. IEEE Trans. Knowl. Data Eng. (TKDE) 27(9): 2507-2521 (2015)

[3] Weiguo Zheng, Lei Zou, Yansong Feng, Lei Chen, Dongyan Zhao: Efficient SimRank-based Similarity Join Over Large Graphs. Proceeding of VLDB (PVLDB) 6(7): 493-504 (2013)

[4] Lei Zou, Lei Chen, M. Tamer Özsu, Dongyan Zhao: Answering pattern match queries in large graph databases via graph embedding. VLDB Journal (VLDB J). 21(1): 97-120 (2012)

[5] Lei Zou, Lei Chen: Pareto-Based Dominant Graph: An Efficient Indexing Structure to Answer Top-K Queries. IEEE Trans. Knowl. Data Eng. (TKDE) 23(5): 727-741 (2011)

Graph-based RDF Data Management

The increasing size of RDF data requires efficient systems to store and query them. There have been efforts to map RDF data to a relational representation, and a number of systems exist that follow this approach. We have been investigating an alternative approach of maintaining the native graph model to represent RDF data, and utilizing graph database techniques (such as a structure-aware index and a graph matching algorithm) to address RDF data management. More specifically, we focus on the following two aspects:

1. gStore: RDF Storage and Query Engine

We design a graph-based RDF data management system (or what is commonly called a “triple store") that maintains the graph structure of the original RDF data. Its data model is a labeled, directed multi-edge graph (called RDF graph), where each vertex corresponds to a subject or an object. We also represent a given SPARQL query by a query graph Q. Query processing involves finding subgraph matches of Q over the RDF graph G. gStore incorporates an index with a number of associated pruning techniques to speed up subgraph matching over the RDF graph. Now, gStore is an open source project at Github under BSD license. The centralized version can support more than four billion triples in a single machine. Furthermore, we also study several distributed RDF systems to address the scalability of gStore system.

2. Natural Language Question/Answering over Knowledge Graphs

The complexity of the SPARQL syntax and the lack of a schema make it hard for end users to use SPARQL. Providing end users an easy-to-use interface to access RDF datasets in an effective way has been recognised as an important concern. In this project, we study how to answer users' natural language questions and keywords over RDF knowledge graphs. Generally, we design gAnswer system that can transform users' natural language question sentences and keywords into SPARQLs. Finally, we employ gStore to answer these translated SPARQLs to return answers to end users. Furthermore, we also study how to design a self-correction mechanism by learning from users' feedback to make the system return more precise answers.

Representative Work:

[1] Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Özsu, Dongyan Zhao: gStore: Answering SPARQL Queries via Subgraph Matching. Proceeding of VLDB 4(8): 482-493 (2011)

[2] Lei Zou, Ruizhe Huang, Haixun Wang, Jeffrey Xu Yu, Wenqiang He, Dongyan Zhao: Natural language question answering over RDF: a graph data driven approach. SIGMOD Conference 2014: 313-324

[3] Peng Peng, Lei Zou, M. Tamer Özsu, Lei Chen, Dongyan Zhao: Processing SPARQL queries over distributed RDF graphs. VLDB J. 25(2): 243-268 (2016)

[4] Peng Peng, Lei Zou, M. Tamer Özsu, Dongyan Zhao: Multi-query Optimization in Federated RDF Systems. DASFAA (1) 2018: 745-765 (BEST PAPER Award)

[5] Sen Hu, Lei Zou, Xinbo Zhang, A State-transition Framework to Answer Complex Questions over Knowledge Base. EMNLP 2018.

Dynamic and Streaming Graph Data Management and Analysis

Due to the high indexing overhead, traditional index-based graph data management techniques do not work well in the context of high-speed dynamic or streaming graph data.Therefore, in this project, we study the following issues: how to design a uniform data structure to support diverse query or analysis over dynamic/streaming graphs; how to design a high-throughput parallel dynamic/streaming graph systems and how to design effective graph-oriented probabilistic data structures, such as graph sketch.

Representative Work:

[1] Youhuan Li, Lei Zou, Tamer Ozsu, Dongyan Zhao, Time Constrained Continuous Subgraph Search over Streaming Graphs, ICDE 2019

[2] Xiangyang Gou, Lei Zou, Chenxingyu Zhao,Tong Yang, Fast and Accurate Graph Stream Summarization, ICDE 2019