Publications
Publications in databases, vector search, graph systems, and Data+AI infrastructure.
^ indicates first author or corresponding author.
2026
- ICDESQLVec: SQL-Based Vector Similarity SearchZequn Zhang , Yuanyuan Zhu , Hao Zhang , and Jeffrey Xu YuIn 42nd IEEE International Conference on Data Engineering (ICDE), 2026(To appear)
Brings vector similarity search into SQL, making nearest-neighbor style retrieval a first-class capability inside relational data systems.
@inproceedings{ICDE-26, title = {SQLVec: SQL-Based Vector Similarity Search}, author = {Zhang, Zequn and Zhu, Yuanyuan and Zhang, Hao and Yu, Jeffrey Xu}, year = {2026}, booktitle = {42nd IEEE International Conference on Data Engineering (ICDE)}, note = {(To appear)}, } - SIGMODAccelerating Triangle-Connected Truss Community Search Across Heterogeneous HardwareJunchao Ma , Xin Yan , Yuanyuan Zhu , Guojing Li , Hao Zhang , and Jeffrey Xu YuIn ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), 2026(To appear)
Accelerates triangle-connected truss community search across heterogeneous hardware, combining graph algorithms with hardware-aware execution.
@inproceedings{SIGMOD-26-1, title = {Accelerating Triangle-Connected Truss Community Search Across Heterogeneous Hardware}, author = {Ma, Junchao and Yan, Xin and Zhu, Yuanyuan and Li, Guojing and Zhang, Hao and Yu, Jeffrey Xu}, year = {2026}, booktitle = {ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD)}, note = {(To appear)}, } - SIGMODTQEX(SQL): Tensor-based Query Engine Enhanced by Bridging the GapHaitao Zhang , Ran Pang , Yuanyuan Zhu , Hao Zhang^ , Congli Gao , Ming Zhong , Jiawei Jiang , Tieyun Qian , and Jeffrey Xu YuIn ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), 2026
With the development of AI and the growing demand for computational power, hardware is becoming increasingly specialized and heterogeneous. The emergence of diverse specialized hardware architectures, each with distinct characteristics and programming abstractions, poses significant portability and sustainability challenges for existing data processing systems. Tensor Computation Runtimes (TCRs) abstract away the low-level hardware complexities by providing users with a hardware-independent tensor-based interface, enabling data scientists to effectively leverage the powerful capabilities of new hardware accelerators (collectively referred to as XPU). Built on TCRs, the existing relational query engine TQP demonstrates portability across a wide range of target hardware and sustainability along with the ongoing evolution of TCRs and hardware. However, it neglects the big gap between irregular SQL workloads and uniform tensor operations when mapping SQL operators to tensor programs, which causes significant storage and computation overhead. In this paper, for the first time, we analyze the underlying gap between SQL and tensors, and provide guidelines to bridge it. Following these guidelines, we build a new Tensor-based Query Engine Enhanced (TQEx) by bridging the gap from multiple aspects: develop efficient storage and computation strategies for variable-length data, and design efficient SQL operators such as join and aggregate based on tensors. We also extend TQEx to multi-XPUs for large-scale data processing. Extensive experimental studies show that our query engine, TQEx, achieves a 9.6x speedup (with a peak of 41.9x) over TQP on TPC-H, and it is also 27.9x faster than leading GPU databases such as HeavyDB. On TPC-H at scale factor 100, TQEx outperforms DuckDB by 12.2x and HeavyDB by 22.7x on supported queries.
@inproceedings{SIGMOD-26-2, title = {TQEX(SQL): Tensor-based Query Engine Enhanced by Bridging the Gap}, author = {Zhang, Haitao and Pang, Ran and Zhu, Yuanyuan and Zhang, Hao and Gao, Congli and Zhong, Ming and Jiang, Jiawei and Qian, Tieyun and Yu, Jeffrey Xu}, year = {2026}, booktitle = {ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD)}, doi = {10.1145/3769835} } - VLDBAquila: A High-Concurrency System for Incremental Graph QueryZiqi Zhou , Hao Zhang^ , Jiaxin Yao , Kangfei Zhao , Zhiwei Zhang , Sen Gao , Jingpeng Hao , Ye Yuan , and Guoren WangIn 52nd International Conference on Very Large Data Bases (VLDB), 2026(To appear)
Explores high-concurrency execution for incremental graph queries, targeting fast updates and responsive graph analysis at scale.
@inproceedings{VLDB-26, title = {Aquila: A High-Concurrency System for Incremental Graph Query}, author = {Zhou, Ziqi and Zhang, Hao and Yao, Jiaxin and Zhao, Kangfei and Zhang, Zhiwei and Gao, Sen and Hao, Jingpeng and Yuan, Ye and Wang, Guoren}, year = {2026}, booktitle = {52nd International Conference on Very Large Data Bases (VLDB)}, note = {(To appear)}, } - arXivSEMA: A Unified Agentic System for Multimodal Data Workflows2026ArXiv preprint
SEMA is a unified agentic system that supports multimodal retrieval, semantic ETL, and analytics over structured and unstructured data in one end-to-end workflow.
@misc{Arxiv-26, title = {SEMA: A Unified Agentic System for Multimodal Data Workflows}, author = {}, year = {2026}, note = {ArXiv preprint}, archiveprefix = {arXiv}, primaryclass = {cs.DB}, }
2025
- DASFAABreaking Free from Label Limitations: A Novel Unsupervised Attack Method for Graph ClassificationYadong Wang , Zhiwei Zhang , Pengpeng Qiao , Ye Yuan , Hao Zhang , and Guoren WangIn 30th International Conference on Database Systems for Advanced Applications (DASFAA), 2025
Graph Neural Networks (GNNs) have proven highly effective for graph classification across diverse fields. However, despite their success, GNNs remain vulnerable to adversarial attacks that can significantly degrade their classification accuracy. Existing adversarial attack strategies mainly focus on supervised approaches, limiting their applicability in scenarios where the label information is scarce or unavailable. This paper introduces an innovative unsupervised attack method for graph classification that operates without relying on label information. Specifically, our method first leverages a graph contrastive learning loss to learn robust graph embeddings by comparing different stochastic augmented views of the graphs. To effectively perturb the graphs, we introduce an implicit estimator and flip edges with the top-k highest scores, determined by the estimator, to maximize the degradation of the model’s performance. Experiments on multiple datasets show the effectiveness of our proposed unsupervised attack strategy in degrading the performance of various graph classification models.
@inproceedings{DASFAA-25, title = {Breaking Free from Label Limitations: A Novel Unsupervised Attack Method for Graph Classification}, author = {Wang, Yadong and Zhang, Zhiwei and Qiao, Pengpeng and Yuan, Ye and Zhang, Hao and Wang, Guoren}, year = {2025}, booktitle = {30th International Conference on Database Systems for Advanced Applications (DASFAA)}, doi = {10.1007/978-981-95-4155-3_40} } - SIGMODGES: High-Performance Graph Processing Engine and Service in HuaweiSen Gao , Jianwen Zhao , Hao Zhang^ , Shixuan Sun , Chen Liang , Gongye Chen , and et al.In ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), 2025
This paper presents the Graph Engine Service (GES), an advanced graph database management system characterized by its composable architecture and highly factorized query executor. This design enables GES to manage a substantial volume of concurrent queries with remarkable efficiency. GES currently holds the top position in the authoritative LDBC-SNB-DECLARATIVE benchmark rankings, showcasing its superior performance - exceeding the throughput of the second-place system by over three orders of magnitude, which provides its customers with exceptional cost-effectiveness and extraordinary user experiences.
@inproceedings{SIGMOD-25-1, title = {GES: High-Performance Graph Processing Engine and Service in Huawei}, author = {Gao, Sen and Zhao, Jianwen and Zhang, Hao and Sun, Shixuan and Liang, Chen and Chen, Gongye and others}, year = {2025}, booktitle = {ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD)}, doi = {10.1145/3722212.3724439} } - SIGMODTGraph: A Tensor-centric Graph Processing FrameworkYongliang Zhang , Yuanyuan Zhu , Hao Zhang^ , Congli Gao , Yuyang Wang , and et al.In ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), 2025
Graph is ubiquitous in various real-world applications, and many graph processing systems have been developed. Recently, hardware accelerators have been exploited to speed up graph systems. However, such hardware-specific systems are hard to migrate across different hardware backends. In this paper, we propose the first tensor-based graph processing framework, TGraph, which can be smoothly deployed and run on any powerful hardware accelerators (uniformly called XPU) that support Tensor Computation Runtimes (TCRs). TCRs, which are deep learning frameworks along with their runtimes and compilers, provide tensor-based interfaces to users to easily utilize specialized hardware accelerators without delving into the complex low-level programming details. However, building an efficient tensor-based graph processing framework is non-trivial. Thus, we make the following efforts: (1) propose a tensor-centric computation model for users to implement graph algorithms with easy-to-use programming interfaces; (2) provide a set of graph operators implemented by tensor to shield the computation model from the detailed tensor operators so that TGraph can be easily migrated and deployed across different TCRs; (3) design a tensor-based graph compression and computation strategy and an out-of-XPU-memory computation strategy to handle large graphs. We conduct extensive experiments on multiple graph algorithms (BFS, WCC, SSSP, etc.), which validate that TGraph not only outperforms seven state-of-the-art graph systems, but also can be smoothly deployed and run on multiple DL frameworks (PyTorch and TensorFlow) and hardware backends (Nvidia GPU, AMD GPU, and Apple MPS).
@inproceedings{SIGMOD-25-2, title = {TGraph: A Tensor-centric Graph Processing Framework}, author = {Zhang, Yongliang and Zhu, Yuanyuan and Zhang, Hao and Gao, Congli and Wang, Yuyang and others}, year = {2025}, booktitle = {ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD)}, doi = {10.1145/3709731} } - SIGMODRevisiting the Design of In-Memory Dynamic Graph StorageJixian Su , Chiyu Hao , Shixuan Sun , Hao Zhang , Sen Gao , and et al.In ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), 2025
The effectiveness of in-memory dynamic graph storage (DGS) for supporting concurrent graph read and write queries is crucial for real-time graph analytics and updates. Various methods have been proposed, for example, LLAMA, Aspen, LiveGraph, Teseo, and Sortledton. These approaches differ significantly in their support for read and write operations, space overhead, and concurrency control. However, there has been no systematic study to explore the trade-offs among these dimensions. In this paper, we evaluate the effectiveness of individual techniques and identify the performance factors affecting these storage methods by proposing a common abstraction for DGS design and implementing a generic test framework based on this abstraction. Our findings highlight several key insights: 1) Existing DGS methods exhibit substantial space overhead. For example, Aspen consumes 3.3-10.8x more memory than CSR, while the optimal fine-grained methods consume 4.1-8.9x more memory than CSR, indicating a significant memory overhead. 2) Existing methods often overlook memory access impact of modern architectures, leading to performance degradation compared to continuous storage methods. 3) Fine-grained concurrency control methods, in particular, suffer from severe efficiency and space issues due to maintaining versions and performing checks for each neighbor. These methods also experience significant contention on high-degree vertices. Our systematic study reveals these performance bottlenecks and outlines future directions to improve DGS for real-time graph analytics.
@inproceedings{SIGMOD-25-3, title = {Revisiting the Design of In-Memory Dynamic Graph Storage}, author = {Su, Jixian and Hao, Chiyu and Sun, Shixuan and Zhang, Hao and Gao, Sen and others}, year = {2025}, booktitle = {ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD)}, doi = {10.1145/3709720} } - VLDBRapidStore: An Efficient Dynamic Graph Storage System for Concurrent QueriesChiyu Hao , Jixian Su , Shixuan Sun , Hao Zhang , Sen Gao , Jianwen Zhao , Chenyi Zhang , and et al.In 51st International Conference on Very Large Data Bases (VLDB), 2025
Dynamic graph storage systems are essential for real-time applications such as social networks and recommendation, where the graph continuously evolves. However, they face significant challenges in efficiently handling concurrent read and write operations. We find that existing methods suffer from write queries interfering with read efficiency, substantial time and space overhead due to per-edge versioning, and an inability to balance performance, such as slow searches. To address these issues, we propose RapidStore, a holistic approach for efficient in-memory dynamic graph storage designed for read-intensive workloads. Our key idea is to exploit the characteristics of graph queries through a decoupled system design that separates the management of read and write queries and decouples version data from graph data. Besides, we design an efficient dynamic graph store to cooperate with the graph concurrency control mechanism. Experiments show that RapidStore enables fast and scalable concurrent graph queries, effectively balancing the performance of inserts, searches, and scans, and significantly improving efficiency in dynamic graph storage systems.
@inproceedings{VLDB-25, title = {RapidStore: An Efficient Dynamic Graph Storage System for Concurrent Queries}, author = {Hao, Chiyu and Su, Jixian and Sun, Shixuan and Zhang, Hao and Gao, Sen and Zhao, Jianwen and Zhang, Chenyi and others}, year = {2025}, booktitle = {51st International Conference on Very Large Data Bases (VLDB)}, doi = {10.14778/3748191.3748217}, }
2024
- ICDELabel Constrained Reachability Queries on Time Dependent GraphsYishu Wang , Jinlong Chu , Ye Yuan , Yu Gu , Hangxu Ji , and Hao ZhangIn 40th IEEE International Conference on Data Engineering (ICDE), 2024
Label-constrained reachability (LCR) has been extensively studied. However, these studies have neglected two aspects: the label sequence and time-dependent properties. When processing reachability queries, not only label presence but also label sequence and time-dependent properties should be considered. Various real-world scenarios, including vehicular networks, computing networks, and biological networks, require such queries. In this paper, we present a formal definition of time-dependent label-constrained reachability (TDLCR) queries based on LCR. These queries require both label sequence and time-dependent constraints to be considered, thus introducing a higher level of complexity. To address this challenge, we propose two indexing algorithms that are optimized for the label constraint: OneL and TD2H. OneL builds a single-label index for each vertex and provides a baseline for solving the TDLCR problem. TD2H is based on classical 2-hop index with excellent query efficiency, while innovative pruning rules and vertex order strategies are proposed to reduce indexing overhead. To further balance indexing overhead and query efficiency and to optimize the time-dependent constraint, we introduce a BII algorithm. It effectively improves index construction efficiency by building only a local index instead of a global one. Finally, experiments on many real datasets demonstrate that although the BII has a slightly inferior query time to TD2H, it has a significant advantage in the index construction.
@inproceedings{ICDE-24-1, title = {Label Constrained Reachability Queries on Time Dependent Graphs}, author = {Wang, Yishu and Chu, Jinlong and Yuan, Ye and Gu, Yu and Ji, Hangxu and Zhang, Hao}, year = {2024}, booktitle = {40th IEEE International Conference on Data Engineering (ICDE)}, doi = {10.1109/ICDE60146.2024.00251} } - ICDEAttributed Network Embedding in Streaming StyleAnbiao Wu , Ye Yuan , Changsheng Li , Yuliang Ma , and Hao ZhangIn 40th IEEE International Conference on Data Engineering (ICDE), 2024
Attributed network embedding (ANE) can learn low-dimensional embeddings for nodes in attributed graphs, which can facilitate several data analysis tasks. However, the existing ANE methods fail to tackle scenarios involving the continuous generation of attributes. The ongoing generation of attributes accumulates numerous attributes, incurring high storage costs in existing methods. Furthermore, due to storage limitations, old attributes will be discarded as new ones are generated, existing methods struggle to integrate the new attribute information into embeddings generated from old attributes. Therefore, we propose a novel ANE framework named SANE (Streaming-style ANE), featuring a ’memory’ capability -that is, when updating the embeddings for new attributes, old attribute information can be partly preserved. In SANE, we first define forward and backward affinity between nodes and attributes by reviewing a node as source or target node. The definition guides quick computation of affinity vectors that integrate both topological and attribute information. Meanwhile, we propose an augmentation strategy to enrich node attribute information for enhance the quality of node embeddings. Leveraging the augmented attributes, we iteratively generate forward and backward affinity vectors, providing quantification of node-attribute affinity in two directions. Subsequently, we achieve a streaming-style update of node embeddings by employing matrix sketching technology on these iteratively generated vectors. Furthermore, capitalizing on the mergeability of matrix sketching, we efficiently integrate information of new generated attributes into node embeddings. Extensive experiments on 5 real datasets demonstrate that SANE surpasses the state-of-the-art algorithms in node classification and link prediction. SANE’s ability to incorporate new attribute information into embeddings in a fast manner is validated through adequate simulation experiments.
@inproceedings{ICDE-24-2, title = {Attributed Network Embedding in Streaming Style}, author = {Wu, Anbiao and Yuan, Ye and Li, Changsheng and Ma, Yuliang and Zhang, Hao}, year = {2024}, booktitle = {40th IEEE International Conference on Data Engineering (ICDE)}, doi = {10.1109/ICDE60146.2024.00243} } - VLDBTenGraph: A Tensor-Based Graph Query EngineGuanghua Li , Hao Zhang , Xibo Sun , Qiong Luo , and Yuanyuan ZhuIn 50th International Conference on Very Large Data Bases (VLDB), 2024
We propose a novel tensor-based approach to in-memory graph query processing. Tensors are multi-dimensional arrays, and have been utilized as data units in deep learning frameworks such as TensorFlow and PyTorch. Through tensors, these frameworks encapsulate optimized hardware-dependent code for automatic performance improvement on modern processors. Inspired by this practice, we explore how to utilize tensors to efficiently process graph queries. Specifically, we design a succinct storage format for tensors to represent graph topology effectively and compose graph query operations using tensor computation on batches of vertices. We have developed TenGraph, our PyTorch-based prototype, and evaluated it on graph query benchmark workloads in comparison with a variety of CPU- and GPU-based systems. Our experimental results show that TenGraph not only achieves a speedup of 50-100 times on the GPU over the CPU but also outperforms the other CPU- and GPU-based systems significantly.
@inproceedings{VLDB-24, title = {TenGraph: A Tensor-Based Graph Query Engine}, author = {Li, Guanghua and Zhang, Hao and Sun, Xibo and Luo, Qiong and Zhu, Yuanyuan}, year = {2024}, booktitle = {50th International Conference on Very Large Data Bases (VLDB)}, doi = {10.14778/3704965.3704967}, }
2023
- VLDBJLearned sketch for subgraph counting: a holistic approachKangfei Zhao , Jeffrey Xu Yu , Qiyan Li , Hao Zhang , and Yu RongThe VLDB Journal, 2023
Subgraph counting, as a fundamental problem in network analysis, is to count the number of subgraphs in a data graph that match a given query graph by either homomorphism or subgraph isomorphism. The importance of subgraph counting derives from the fact that it provides insights of a large graph, in particular a labeled graph, when a collection of query graphs with different sizes and labels are issued. The problem of counting is challenging. On the one hand, exact counting by enumerating subgraphs is NP-hard. On the other hand, approximate counting by subgraph isomorphism can only support small query graphs over unlabeled graphs. Another way for subgraph counting is to specify it as an SQL query and estimate the cardinality of the query in RDBMS. Existing approaches for cardinality estimation can only support subgraph counting by homomorphism up to some extent, as it is difficult to deal with sampling failure when a query graph becomes large. A question that arises is how we support subgraph counting by machine learning (ML) and deep learning (DL). To devise an ML/DL solution, apart from the query graphs, another issue is to deal with large data graphs by ML/DL, as the existing DL approach for subgraph isomorphism counting can only support small data graphs. In addition, the ML/DL approaches proposed in RDBMS context for approximate query processing and cardinality estimation cannot be used, as subgraph counting is to do complex self-joins over one relation, whereas existing approaches focus on multiple relations. In this work, we propose an active learned sketch for subgraph counting (ALSS) with two main components: a learned sketch for subgraph counting and an active learner. The sketch is constructed by a neural network regression model, and the active learner is to perform model updates based on new arrival test query graphs. Our holistic learning framework supports both undirected graphs and directed graphs, whose nodes and/or edges are associated zero to multiple labels. We conduct extensive experimental studies to confirm the effectiveness and efficiency of ALSS using large real labeled graphs. Moreover, we show that ALSS can assist query optimizers in finding a better query plan for complex multi-way self-joins.
@article{VLDBJ-23, title = {Learned sketch for subgraph counting: a holistic approach}, author = {Zhao, Kangfei and Yu, Jeffrey Xu and Li, Qiyan and Zhang, Hao and Rong, Yu}, year = {2023}, journal = {The VLDB Journal}, doi = {10.1007/s00778-023-00781-5} }
2022
- ICDEHow Learning Can Help Complex Cyclic Join DecompositionHao Zhang^ , Qiyan Li , Kangfei Zhao , Jeffrey Xu Yu , and Yuanyuan ZhuIn 38th IEEE International Conference on Data Engineering (ICDE) (Demo), 2022
Recently, machine learning (ML) and deep learning (DL) techniques have been extensively studied in database systems including cardinality/selectivity estimation for optimizing queries with selections and joins. However, the issue of how to support complex cyclic join queries by ML/DL has not yet been well studied. An important research issue in optimizing complex cyclic join queries is how to decompose complex cyclic joins into a join tree where a node in the join tree may represent a subquery with cyclic joins. The main application of complex cyclic join queries is to support subgraph matching queries, which find matches of a user-given pattern graph in a large node/edge-labeled graph by subgraph isomorphism, when a graph is stored in a relational database system. Here, when a graph is stored in an edge table, the joins will be mainly self-joins. In the existing work, such decomposition is done by estimation with AGM bound. In this work, we demonstrate how ML/DL can support such complex cyclic self-joins by providing a more accurate estimation. We build a prototyped system, LSSMatch, based on ML/DL techniques, with a GUI to provide insights to observe how ML/DL-based techniques contribute to query optimization for complex cyclic self-join queries.
@inproceedings{ICDE-22, title = {How Learning Can Help Complex Cyclic Join Decomposition}, author = {Zhang, Hao and Li, Qiyan and Zhao, Kangfei and Yu, Jeffrey Xu and Zhu, Yuanyuan}, year = {2022}, booktitle = {38th IEEE International Conference on Data Engineering (ICDE) (Demo)}, doi = {10.1109/ICDE53745.2022.00282} } - SIGMODParallel Query Processing: To Separate Communication from ComputationHao Zhang^ , Jeffrey Xu Yu , Yikai Zhang , and Kangfei ZhaoIn ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), 2022
In this paper, we study parallel query processing with a focus on reducing the communication cost, which is the dominating factor in parallel query processing. The communication cost becomes large if the intermediate results between operators are large in intra-operator parallelism. In the existing approaches, it optimizes an SQL query by arranging relational algebra operators to reduce the total cost, where, for each operator, it involves (i) distribution of data partitioned to computing nodes by communication, and (ii) computation on computing nodes locally. The communication and computation are dealt with inside an operator and are not separable. In other words, it is difficult to avoid large intermediate results and hence reduce the communication cost. To reduce communication cost, we separate communication from computation using several new operators proposed in this paper. One is a pair operator (⊗) to pair the partitions of a relation 𝑅 with the partitions of a relation 𝑆, where a partition is specified by a hash function. With the pair operator defined, we can explicitly deal with communication to deliver pairs of partitions to computing nodes. Together with ⊗, we can also explicitly treat the local computation on a computing node as op’ for any RA (relational algebra) operator op. We give a merge operator (U’), to collect all partial results from computing nodes as they are. In short, with ⊗, op’, and U’ , we are able to explicitly specify communication and computation for RA operators. Furthermore, we propose new techniques, namely, partitioning push-down and computation push-up to separate communication from computation for RA expressions. We prove that we can push-down/up for a wide range of relational expressions. We have developed a distributed system named Secco (Separate Communication from Computation) by revamping SparkSQL on Spark, and confirmed the efficiency of our approach in our performance studies using real datasets.
@inproceedings{SIGMOD-22-1, title = {Parallel Query Processing: To Separate Communication from Computation}, author = {Zhang, Hao and Yu, Jeffrey Xu and Zhang, Yikai and Zhao, Kangfei}, year = {2022}, booktitle = {ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD)}, doi = {10.1145/3514221.3526164} } - SIGMODLightweight and Accurate Cardinality Estimation by Neural Network Gaussian ProcessKangfei Zhao , Jeffrey Xu Yu , Zongyan He , Rui Li , and Hao ZhangIn ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), 2022
Deep Learning (DL) has achieved great success in many real applications. Despite its success, there are some main problems when deploying advanced DL models in database systems, such as hyper-parameters tuning, the risk of overfitting, and lack of prediction uncertainty. In this paper, we study a lightweight and accurate cardinality estimation for SQL queries, which is also uncertainty-aware. By lightweight, we mean that we can train a DL model in a few seconds. With uncertainty ensured, it becomes possible to update the estimator to improve its prediction in areas with high uncertainty. The approach we explore is different from the direction of deploying sophisticated DL models as cardinality estimators in database systems. We employ Bayesian deep learning (BDL), which serves as a bridge between Bayesian inference and deep learning. The prediction distribution by BDL provides principled uncertainty calibration for the prediction. In addition, when the network width of a BDL model goes to infinity, the model performs equivalent to Gaussian Process (GP). This special class of BDL, known as Neural Network Gaussian Process (NNGP), inherits the advantages of Bayesian approach while keeping universal approximation of neural networks, and can utilize a much larger model space to model distribution-free data as a nonparametric model. We show our NNGP estimator achieves high accuracy, is built fast, and is robust to query workload shift, in our extensive performance studies by comparing with existing learned estimators. We also confirm the effectiveness of NNGP by integrating it into PostgreSQL.
@inproceedings{SIGMOD-22-2, title = {Lightweight and Accurate Cardinality Estimation by Neural Network Gaussian Process}, author = {Zhao, Kangfei and Yu, Jeffrey Xu and He, Zongyan and Li, Rui and Zhang, Hao}, year = {2022}, booktitle = {ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD)}, doi = {10.1145/3514221.3526156} }
2021
- ICDEFast Distributed Complex Join ProcessingHao Zhang^ , Miao Qiao , Jeffrey Xu Yu , and Hong ChengIn 37th IEEE International Conference on Data Engineering (ICDE) (Short), 2021
Big data analytics often requires processing complex join queries in parallel in distributed systems such as Hadoop, Spark, Flink. The previous works consider that the main bottleneck of processing complex join queries is the communication cost incurred by shuffling of intermediate results, and propose a way to cut down such shuffling cost to zero by a one-round multi-way join algorithm. The one-round multi-way join algorithm is built on a one-round communication optimal algorithm for data shuffling over servers and a worst-case optimal computation algorithm for sequential join evaluation on each server. The previous works focus on optimizing the communication bottleneck, while neglecting the fact that the query could be computationally intensive. With the communication cost being well optimized, the computation cost may become a bottleneck. To reduce the computation bottleneck, a way is to trade computation with communication via pre-computing some partial results, but it can make communication or pre-computing becomes the bottleneck. With one of the three costs being considered at a time, the combined lowest cost may not be achieved. Thus the question left unanswered is how much should be traded such that the combined cost of computation, communication, and pre-computing is minimal. In this work, we study the problem of co-optimizing communication, pre-computing, and computation cost in one-round multi-way join evaluation. We propose a multi-way join approach ADJ (Adaptive Distributed Join) for complex join which finds one optimal query plan to process by exploring cost-effective partial results in terms of the trade-off between pre-computing, communication, and computation. We analyze the input relations for a given join query and find one optimal over a set of query plans in some specific form, with high-quality cost estimation by sampling. Our extensive experiments confirm that ADJ outperforms the existing multi-way join methods by up to orders of magnitude.
@inproceedings{ICDE-21, title = {Fast Distributed Complex Join Processing}, author = {Zhang, Hao and Qiao, Miao and Yu, Jeffrey Xu and Cheng, Hong}, year = {2021}, booktitle = {37th IEEE International Conference on Data Engineering (ICDE) (Short)}, } - SIGMODA Learned Sketch for Subgraph CountingKangfei Zhao , Jeffrey Xu Yu , Hao Zhang , Qiyan Li , and Yu RongIn ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), 2021
Subgraph counting, as a fundamental problem in network analysis, is to count the number of subgraphs in a data graph that match a given query graph by either homomorphism or subgraph isomorphism. The importance of subgraph counting drives from the fact that it provides insights of a large graph, in particular a labeled graph, when a collection of query graphs with different sizes and labels are issued. The problem of counting is challenging. On one hand, exact counting by enumerating subgraphs is NP-hard. On the other hand, approximate counting by subgraph isomorphism can only support 3/5-node query graphs over unlabeled graphs. Another way for subgraph counting is to specify it as an SQL query, and estimate the cardinality of the query in RDBMS. Existing approaches for cardinality estimation can only support subgraph counting by homomorphism up to some extent, as it is difficult to deal with sampling failure when a query graph becomes large. A question arisen is if subgraph counting can be supported by machine learning (ML) and deep learning (DL). The existing DL approach for subgraph isomorphism can only support small data graphs. The ML/DL approaches proposed in RDBMS context for approximate query processing and cardinality estimation cannot be used, as subgraph counting is to do complex self-joins over one relation, whereas existing approaches focus on multiple relations. In this paper, we propose an Active Learned Sketch for Subgraph Counting (ALSS) with two main components: a sketch learned (LSS) and an active learner (AL). The sketch is learned by a neural network regression model, and the active learner is to perform model updates based on new arrival test query graphs. We conduct extensive experimental studies to confirm the effectiveness and efficiency of ALSS using large real labeled graphs. Moreover, we show that ALSS can assist query optimizer to find a better query plan for complex multi-way self-joins.
@inproceedings{SIGMOD-21, title = {A Learned Sketch for Subgraph Counting}, author = {Zhao, Kangfei and Yu, Jeffrey Xu and Zhang, Hao and Li, Qiyan and Rong, Yu}, year = {2021}, booktitle = {ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD)}, doi = {10.1145/3448016.3457289} }
2020
- VLDBDistributed Subgraph Counting: A General ApproachHao Zhang^ , Jeffrey Xu Yu , Yikai Zhang , Kangfei Zhao , and Hong ChengIn 46th International Conference on Very Large Data Bases (VLDB), 2020
In this paper, we study local subgraph counting, which is to count the occurrences of a user-given pattern graph p around every node v in a data graph G, when v matches to a given orbit o in p, where the orbit serves as a center to count p. In general, the orbit can be a node, an edge, or a set of nodes in p. Local subgraph counting has played an important role in characterizing high-order local structures that exhibit in a large graph, and has been widely used in denser and relevant communities mining, graphlet degree distribution, discriminative features selection for link prediction, relational classification and recommendation. In the literature, almost all the existing works support a k-node pattern graph, for k <= 5, with either 1 node orbit or 1 edge orbit. Their approaches are difficult to support larger k due to the fact that subgraph counting is to count by subgraph isomorphism. In this work, we develop a new general approach to count any k pattern graphs with any orbits selected. The key idea behind is that we do local subgraph counting by homomorphism counting, which can be solved by relational algebra using joins, group-by and aggregation. By homomorphism counting, we do local subgraph counting by eliminating counts for those that are not subgraph isomorphism matchings from the total count for any possible matchings. We have developed a distributed system named DISC on Spark. Our extensive experiments validate the efficiency of our approach by testing 114 local subgraph counting queries used in the existing work over real graphs, where no existing work can support all.
@inproceedings{VLDB-20, title = {Distributed Subgraph Counting: A General Approach}, author = {Zhang, Hao and Yu, Jeffrey Xu and Zhang, Yikai and Zhao, Kangfei and Cheng, Hong}, year = {2020}, booktitle = {46th International Conference on Very Large Data Bases (VLDB)}, doi = {10.14778/3407790.3407840}, }
2019
- TKDESQL-G: Efficient Graph Analytics by SQLKangfei Zhao , Jiao Su , Jeffrey Xu Yu , and Hao ZhangIEEE Transactions on Knowledge and Data Engineering (TKDE), 2019
Querying graphs and conducting graph analytics become important in data processing since many real applications are dealing with massive graphs, such as online social networks, Semantic Web, knowledge graphs, etc. Over the years, many distributed graph processing systems have been developed to support graph analytics using various programming models, and many graph querying languages have been proposed. A natural question that arises is how to integrate graph data and traditional non-graph data in a distributed system for users to conduct analytics. There are two issues. One issue is related to expressiveness on how to specify graph analytics as well as data analytics by a querying language. The other issue is related to efficiency on how to process analytics in a distributed system. For the first issue, SQL is a best candidate, since SQL is a well-accepted language for data processing. We concentrate on SQL for graph analytics. Our early work shows that graph analytics can be supported by SQL in a way from “semiring + while” to “relational algebra + while” via the enhanced recursive SQL queries. In this article, we focus on the second issue on how to process such enhanced recursive SQL queries based on the GAS ( Gather - Apply - Scatter ) model under which efficient graph processing systems can be developed. To demonstrate the efficiency, we implemented a system by tightly coupling Spark SQL and GraphX on Spark which is one of the most popular in-memory data-flow processing platforms. First, we enhance Spark SQL by adding the capability of supporting the enhanced recursive SQL queries for graph analytics. In this regard, graph analytics can be processed using a distributed SQL engine alone. Second, we further propose new transformation rules to optimize/translate the operations for recursive SQL queries to the operations by GraphX . In this regard, graph analytics by SQL can be processed in a similar way as done by a distributed graph processing system using the APIs provided by the system. We conduct extensive performance studies to test graph analytics using large real graphs. We show that our approach can achieve similar or even higher efficiency, in comparison to the built-in graph algorithms in the existing graph processing systems.
@article{TKDE-19, title = {SQL-G: Efficient Graph Analytics by SQL}, author = {Zhao, Kangfei and Su, Jiao and Yu, Jeffrey Xu and Zhang, Hao}, year = {2019}, journal = {IEEE Transactions on Knowledge and Data Engineering (TKDE)}, doi = {10.1109/TKDE.2019.2950620} }
2018
- VLDBSubgraph matching: on compression and computationMiao Qiao , Hao Zhang , and Hong ChengIn 44th International Conference on Very Large Data Bases (VLDB), 2018
Subgraph matching finds a set I of all occurrences of a pattern graph in a target graph. It has a wide range of applications while suffering from expensive computation. This efficiency issue has been studied extensively. All existing approaches, however, turn a blind eye to the output crisis, that is, when the system has to materialize I as a preprocessing, intermediate result, final result, or index, the cost of exporting I dominates the overall cost, which could be prohibitive even for a small pattern graph. This paper studies subgraph matching via two problems. 1) Is there an ideal compression of I? 2) Will the compression of I reversely boost the computation of I? For problem 1, we propose a technique called VCBC to compress I to code(I), which serves effectively the same as I. For problem 2, we propose a subgraph matching computation framework, CBF, which computes code(I) instead of I to bring down the output cost. CBF further reduces the overall cost by reducing the intermediate results. Extensive experiments show that the compression ratio of VCBC can be up to 10^5, which also significantly lowers the output cost of CBF. Extensive experiments show the superior performance of CBF over existing approaches.
@inproceedings{VLDB-18, title = {Subgraph matching: on compression and computation}, author = {Qiao, Miao and Zhang, Hao and Cheng, Hong}, year = {2018}, booktitle = {44th International Conference on Very Large Data Bases (VLDB)}, doi = {10.14778/3149193.3149198}, }
2017
- DASFAAEfficient Local Clustering Coefficient Estimation in Massive GraphsHao Zhang^ , Yuanyuan Zhu , Lu Qin , Hong Cheng , and Jeffrey Xu YuIn 22nd Database Systems for Advanced Applications (DASFAA), 2017
Graph is a powerful tool to model interactions in disparate applications, and how to assess the structure of a graph is an essential task across all the domains. As a classic measure to characterize the connectivity of graphs, clustering coefficient and its variants are of particular interest in graph structural analysis. However, the largest of today’s graphs may have nodes and edges in billion scale, which makes the simple task of computing clustering coefficients quite complicated and expensive. Thus, approximate solutions have attracted much attention from researchers recently. However, they only target global and binned degree-wise clustering coefficient estimation, and their techniques are not suitable for local clustering coefficient estimation that is of great importance for individual nodes. In this paper, we propose a new sampling scheme to estimate the local clustering coefficient with error bounded, where global and binned degree-wise clustering coefficients can be considered as special cases. Meanwhile, based on our sampling scheme, we propose a new framework to estimate all the three clustering coefficients in a unified way. To make it scalable on massive graphs, we further design an efficient MapReduce algorithm under this framework. Extensive experiments validate the efficiency and effectiveness of our algorithms, which significantly outperform state-of-the-art exact and approximate algorithms on many real graph datasets.
@inproceedings{DASFAA-17, title = {Efficient Local Clustering Coefficient Estimation in Massive Graphs}, author = {Zhang, Hao and Zhu, Yuanyuan and Qin, Lu and Cheng, Hong and Yu, Jeffrey Xu}, year = {2017}, booktitle = {22nd Database Systems for Advanced Applications (DASFAA)}, } - DPDEfficient MapReduce algorithms for triangle listing in billion-scale graphsYuanyuan Zhu , Hao Zhang , Lu Qin , and Hong ChengDistributed And Parallel Database (DPD), 2017
This paper addresses the classical triangle listing problem, which aims at enumerating all the tuples of three vertices connected with each other by edges. This problem has been intensively studied in internal and external memory, but it is still an urgent challenge in distributed environment where multiple machines across the network can be utilized to achieve good performance and scalability. As one of the de facto computing methodologies in distributed environment, MapReduce has been used in some of existing triangle listing algorithms. However, these algorithms usually need to shuffle a huge amount of intermediate data, which seriously hinders their scalability on large scale graphs. In this paper, we propose a new triangle listing algorithm in MapReduce, FTL, which utilizes a light weight data structure to substantially reduce the intermediate data transferred during the shuffle stage, and also is equipped with multiple-round techniques to ease the burden on memory and network bandwidth when dealing with graphs at billion scale. We prove that the size of the intermediate data can be well bounded near to the number of triangles in the graph. To further reduce the shuffle size and memory cost, we also propose improved algorithms based on a compact data structure, and present several optimization techniques to accelerate the computation and reduce the memory consumption. The extensive experimental results show that our algorithms outperform existing competitors by several times on both synthetic graphs and real world graphs.
@article{Distributed-And-Parallel-Database-17, title = {Efficient MapReduce algorithms for triangle listing in billion-scale graphs}, author = {Zhu, Yuanyuan and Zhang, Hao and Qin, Lu and Cheng, Hong}, year = {2017}, journal = {Distributed And Parallel Database (DPD)}, }
2016
- BigDataEfficient triangle listing for billion-scale graphsHao Zhang^ , Yuanyuan Zhu , Lu Qin , Hong Cheng , and Jeffrey Xu YuIn IEEE International Conference on Big Data (BigData), 2016
This paper addresses the classical triangle listing problem, which aims at enumerating all the tuples of three vertices connected with each other by edges. This problem has been intensively studied in internal and external memory, but it is still an urgent challenge in distributed environment where multiple machines across the network can be utilized to achieve good performance and scalability. As one of the de facto computing methodologies in distributed environment, MapReduce has been used in some of existing triangle listing algorithms. However, these algorithms usually need to shuffle a huge amount of intermediate data, which seriously hinders the scalability on large scale graphs. In this paper, we propose a new triangle listing algorithm in MapReduce, FTL, which utilizes a light weight data structure to substantially reduce the intermediate data transferred during the shuffle stage, and also is equipped with multiple-round techniques to ease the burden on memory and network bandwidth when dealing with graphs at billion scale. We prove that the size of the intermediate data can be well bounded near to the number of triangles in the graph. To further reduce the shuffle size in each round, we also devise a compact data structure to store the intermediate data, which can save space up to 2/3. The extensive experimental results show that our algorithms outperform existing competitors by several times on large real world graphs.
@inproceedings{BigData-16, title = {Efficient triangle listing for billion-scale graphs}, author = {Zhang, Hao and Zhu, Yuanyuan and Qin, Lu and Cheng, Hong and Yu, Jeffrey Xu}, year = {2016}, booktitle = {IEEE International Conference on Big Data (BigData)}, }