Hao Zhang

Research Scientist at ByteDance

Hao Zhang portrait

Research Scientist, ByteDance

AI-native Data Systems

zhanghao.ai@bytedance.com

I am a Research Scientist at ByteDance, building data systems for AI-native and data-intensive workloads. My current focus is agentic data management — the systems layer beneath LLM agents that turns scattered enterprise and multimodal data into structured, retrieval-optimized knowledge substrates. This includes agentic data integration & knowledge extraction and graph/vector retrieval infrastructure. Beyond this focus, I also work on accelerator-aware query execution, and distributed query processing.

I received my Ph.D. from the Chinese University of Hong Kong in 2022, advised by Prof. Jeffrey Xu Yu and Prof. Hong Cheng, and my B.S. in Computer Science from the Hongyi Honor School at Wuhan University in 2017.

Highlights

  • 25+ publications across database systems, graph/vector retrieval, and Data+AI infrastructure, including papers in SIGMOD, VLDB, ICDE, TKDE, and The VLDB Journal.
  • Designed and architected systems including Sema, GES, SeccoSQL, DISC, and Crystal — from research prototypes to production infrastructure. → Projects
  • LDBC SNB Interactive world-record results in both the declarative track (2024, 3,000× over #2) and the imperative track (2025).

Current Focus: Agentic Data Management

My research asks how data systems should manage heterogeneous information for LLM agents: integrating scattered sources, extracting reliable knowledge, and maintaining it as queryable substrates. A central question is retrieval-aware knowledge representation: what form should extracted knowledge take so that downstream retrieval is accurate, efficient, attributable, and fresh? The agenda is data-centric rather than LLM-centric: agents should operate over curated knowledge substrates whose structure is explicit, queryable, and system-maintained.

This focus spans two layers: agentic data integration & knowledge extraction (what knowledge to extract and how to maintain it) and retrieval & storage substrates (the infrastructure that physically supports these substrates).

Agentic data integration & knowledge extraction

Agents depend on integrated knowledge views over heterogeneous sources — documents, tables, vector indexes, knowledge bases, memory, and multimodal content. Rather than treating documents as opaque context or chunking by layout heuristics, the system derives knowledge units, relations, provenance, temporal scopes, and materialized views — driven by downstream retrieval objectives and agent workloads. Unlike classical virtual integration that defers everything to query time, agentic integration must extract, align, and maintain knowledge between queries: schema and semantic alignment, entity/relation extraction, semantic joins, versioning, deduplication, conflict detection, and freshness. The goal is to turn fragmented corpora into inspectable substrates that agents can query, join, and trust.

Representative work:

Recent writing: Context Management 的下一代:维护模型可感知的世界 (English).

Retrieval and storage substrates

The knowledge substrates above sit on storage, indexing, and retrieval infrastructure. My work here targets dynamic graph stores, vector indexes, and hybrid query engines, where correctness under concurrent updates and retrieval throughput are the primary design constraints.

Representative work:

Broader Systems Work

Accelerator-aware query execution

I also work on accelerator-aware query execution: using tensor runtimes (PyTorch, TensorFlow) to execute SQL and graph operators on GPUs and heterogeneous hardware. The core problems are tensorizing irregular relational and graph operators, managing memory across XPU backends, and making query execution portable across accelerator stacks.

Representative work:

Earlier foundations

My earlier work focused on distributed query processing, subgraph analytics, and learned query optimization. These techniques — efficient joins, cardinality estimation, communication–computation separation, and distributed graph execution — form a systems foundation that I continue to build on today.

Representative work:

News

  • 11/2025 — LDBC SNB Interactive imperative track world record (#1).
  • 10/2025Accelerating Triangle-Connected Truss Community Search Across Heterogeneous Hardware, accepted by SIGMOD’26.
  • 09/2025Aquila: a high-concurrency system for incremental graph query, accepted by VLDB’26.

→ All news

Collaboration

Open to collaboration on agentic data management, retrieval-aware knowledge representation, agentic data integration and knowledge extraction, and vector/graph engines. Internship openings in Shenzhen — email with [Intern] in the subject line.