Overview

GLB 2023 is the third edition of the Workshop of the Graph Learning Benchmarks, encouraged by the success of the previous editions. Inspired by the conference tracks in the computer vision and natural language processing communities that are dedicated to establishing new benchmark datasets and tasks, we call for contributions that establish novel ML tasks on novel graph-structured data which have the potential to (i) increase the diversity of graph learning benchmarks, (ii) identify new demands of graph machine learning in general, and (iii) gain a better synergy of how concrete techniques perform on these benchmarks. We also welcome contributions on data-centric graph learning, such as novel approaches to collect, annotate, clean, augment, and sythesize graph-structured data.

GLB 2023 will be a non-archival workshop; we are excited to host this edition in person in conjunction with KDD 2023. Please click here for KDD 2023 registration.

Our previous call for papers can be found here.

Schedule

We have a full-day program from 8am to 5pm on Sunday (Aug. 6) at Grand Ballroom B.

Time (PDT)	Agenda
8:00-8:10am	Opening remarks
8:10-8:50am	Keynote by Yizhou Sun (40 min): Graph Neural Networks: Trends and Open Problems
8:50-9:30am	Keynote by Da Zheng (40 min): Graph machine learning for industry applications with DGL and GraphStorm
9:30-10:00am	Coffee Break
10:00-10:40am	Keynote by Xavier Bresson (40 min): A Generalization of Visual Transformers and MLP-Mixer to Graphs
10:40-11:30am	Contributed Talks - Session 1 (50 min) (Presentation Time) - A critical look at the evaluation of GNNs under heterophily: Are we really making progress? (Outstanding Paper) - Examining the Effects of Degree Distribution and Homophily in Graph Learning Models - Impact-Oriented Contextual Scholar Profiling using Self-Citation Graphs - An Out-of-the-Box Application for Reproducible Graph Collaborative Filtering extending the Elliot Framework
11:30-1:00pm	Lunch Break (90 min)
1:00-1:40pm	Keynote by Atlas Wang (40 min): Unveiling the simplicity in Training Graph Neural Networks
1:40-2:20pm	Keynote by Jimeng Sun (40 min): Data, Benchmark and Models to Enable AI in Healthcare
2:20-3:00pm	Contributed Talks - Session 2 (40 min) (Presentation Time) - Graph Generative Model for Benchmarking Graph Neural Networks (Outstanding Paper) - TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs - NeuroGraph: Benchmarks for Graph Machine Learning in Brain Connectomics
3:00-3:30pm	Coffee Break (30 min)
3:30-4:30pm	Panel Discussion (60 min) Moderator:Jingrui He Panelists: Michael Galkin, Neil Shah, Yujun Yan
4:30-4:50pm	Contributed Talks - Session 3 (20 min) (Presentation Time) - A Metadata-Driven Approach to Understand Graph Neural Networks - Web-Scale Academic Name Disambiguation: the WhoIsWho Benchmark, Leaderboard, and Toolkit
4:50-5:00pm	Closing Remarks

Click here to view the detailed schedule in Google Sheets.

Keynote Speakers

Yizhou Sun

University of California, Los Angeles
A Graph Benchmark Dataset for Hardware Design Automation

Abstract

Abstract: In recent decades, the demand for specialized computing systems tailored to specific applications has significantly increased. This has led to the emergence of domain-specific accelerators (DSAs) implemented in either application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs). High-level synthesis (HLS) aims to raise the abstraction layer in hardware design, enabling the design of domain-specific accelerators (DSAs) using C/C++ instead of hardware description languages (HDLs). Compiler directives in the form of pragmas play a crucial role in modifying the microarchitecture within the HLS framework. However, the space of possible microarchitectures grows exponentially with the number of pragmas. Moreover, the evaluation of each candidate design using the HLS tool consumes significant time, ranging from minutes to hours, leading to a time-consuming optimization process. To accelerate this process, machine learning models have been used to predict design quality in milliseconds. However, existing open-source datasets for training such models are limited in terms of design complexity and available optimizations. In this talk, we present the first benchmark dataset that addresses these limitations. In this dataset, each program is represented in the form of both control data flow graph (CDFG) and source code. For each program, we provide labels for 5 performance related metrics (e.g., latency) for each design point (pragma configuration). The benchmark consists of 42 unique programs/kernels, resulting in over 42,000 labeled designs. It contains more complex programs with a wider range of optimization pragmas, making it a comprehensive dataset for training and evaluating design quality prediction models. We conduct an extensive comparison of state-of-the-art baselines to assess their effectiveness in predicting design quality. As an ongoing project, we anticipate expanding the benchmark in terms of both quantity and variety of programs to further support the development of this field.

Bio

Bio: Yizhou Sun is an associate professor at the department of computer science of UCLA. She received her Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign in 2012. Her principal research interest is on mining graphs/networks, and more generally in data mining, machine learning, and network science, with a focus on modeling novel problems and proposing scalable algorithms for large-scale, real-world applications. She is a pioneer researcher in mining heterogeneous information network, with a recent focus on deep learning on graphs/networks. Yizhou has over 180 publications in books, journals, and major conferences. Tutorials of her research have been given in many premier conferences. She is a recipient of multiple best paper awards, ACM SIGKDD Doctoral Dissertation Award, Yahoo ACE (Academic Career Enhancement) Award, NSF CAREER Award, CS@ILLINOIS Distinguished Educator Award, Amazon Research Awards (twice), Okawa Foundation Research Award, VLDB Test of Time Award, ACM Distinguished Member, IEEE AI 10-to-Watch Award, and SDM/IBM Faculty Award.

Da Zheng

Amazon
Graph machine learning for industry applications with DGL and GraphStorm

Abstract

Abstract: Graph machine learning (GML) is a powerful tool to model data with relations. However, there are multiple challenges when we apply GML to industry use cases. This includes 1) how to scale to graphs with billions of nodes efficiently and in a cost-effective way, 2) how to process and model complex graphs, such as heterogeneous graphs with rich text features, 3) how to make GML techniques accessible to everyone, even non-GML experts. We develop DGL/DistDGL to scale GML training efficiently. On top of DGL/DistDGL, we develop GraphStorm, an enterprise GML framework, to make large-scale GML training easy. GraphStorm provides a collection of GML model implementations and training methods to handle different types of graph data commonly encountered in the industry use cases. GraphStorm provides no-code/low-code interface so that even non-GML experts can use it easily.

Bio

Bio: Da Zheng is a senior applied scientist at AWS AI. At AWS AI, he is building frameworks and algorithms to bring graph ML technologies into production. This includes DGL for graph neural networks (GNN), DGL-KE for knowledge graph embeddings, DistDGL for scaling GNN training to billion-scale graphs, and TGL for temporal graph neural networks. His research interest covers a wide range of areas, including high-performance computing, large-scale data analysis systems, data mining, and machine learning. Da obtained a PhD from the department of computer science at the Johns Hopkins University. During his PhD, he worked on FlashGraph and FlashR, frameworks for large-scale graph analysis and data analysis on solid-state drives (SSDs).

Xavier Bresson

National University of Singapore
A Generalization of Visual Transformers and MLP-Mixer to Graphs

Abstract

Abstract: Graph Neural Networks (GNNs) have shown great potential in the field of graph representation learning. Standard GNNs define a local message-passing mechanism which propagates information over the whole graph domain by stacking multiple layers. This paradigm suffers from two major limitations, over-squashing and poor long-range dependencies, that can be solved using global attention but significantly increases the computational cost to quadratic complexity. In this work, we propose an alternative approach to overcome these structural limitations by leveraging the ViT/MLP-Mixer architectures introduced in computer vision. We introduce a new class of GNNs, called Graph MLP-Mixer, that holds three key properties. First, they capture long-range dependency as demonstrated on the long-range LRGB datasets and mitigate the over-squashing issue on the TreeNeighbour dataset. Second, they offer memory and speed efficiency, surpassing related techniques. Third, they show high expressivity in terms of graph isomorphism as they can distinguish at least 3-WL isomorphic graphs. As a result, this novel architecture provides significantly better results over standard message-passing GNNs for molecular datasets.

Bio

Bio: Xavier Bresson (PhD 2005, EPFL, Switzerland) is Associate Professor in Computer Science at NUS, Singapore. He is a leading researcher in the field of Graph Deep Learning, a new framework that combines graph theory and deep learning techniques to tackle complex data domains in natural language processing, computer vision, combinatorial optimization, quantum chemistry, physics, neuroscience, genetics and social networks. In 2016, he received the highly competitive Singaporean NRF Fellowship of $2.5M to develop these deep learning techniques. He was also awarded several research grants in the U.S. and Hong Kong. As a leading researcher in the field, he has published more than 60 peer-reviewed papers in the leading journals and conference proceedings in machine learning, including articles in NeurIPS, ICML, ICLR, CVPR, JMLR. He has organized several international workshops and tutorials on AI and deep learning in collaboration with Facebook, NYU and Imperial such as the 2019 and 2018 UCLA workshops, the 2017 CVPR tutorial and the 2017 NeurIPS tutorial. He has been teaching undergraduate, graduate and industrial courses in AI and deep learning since 2014 at EPFL (Switzerland), NTU (Singapore) and UCLA (U.S.).

Atlas Wang

The University of Texas at Austin
Unveiling the simplicity in Training Graph Neural Networks

Abstract

Abstract: Scaling up the training of Graph Neural Networks (GNNs) has been perceived as more challenging than scaling vision or NLP models, primarily due to well-known obstacles such as over-smoothing, over-squashing, and complexities in distributed training. Yet in this talk, we will unveil some surprising simplicity that we have discovered through our experiments with GNNs. Firstly, we will present a fresh perspective on gradient flow to comprehend the underwhelming performance of deep GCNs. By introducing GNN-customized initialization and implementing gradient-guided dynamic rewiring, we demonstrate how these techniques effectively facilitate healthy gradient flow and substantially enhance GNN trainability. Secondly, we embark on an unprecedented exploration by identifying matching untrained GNNs. Leveraging sparsity as a core tool, we can discover untrained sparse subnetworks during initialization that achieve performance on par with fully trained dense GNNs. Lastly, we explore a data-centric approach to tackle ginormous graph data by creating independently and parallelly trained multiple comparatively weaker models, without any intermediate communication. These models are subsequently merged using greedy interpolation, resulting in state-of-the-art performance.

Bio

Bio: Professor Zhangyang “Atlas” Wang is currently the Jack Kilby/Texas Instruments Endowed Assistant Professor in the Chandra Family Department of Electrical and Computer Engineering at The University of Texas at Austin. He received the Ph.D. degree from the University of Illinois at Urbana–Champaign, under the supervision of Prof. Thomas Huang. He was an Assistant Professor at Texas A&M University from 2017 to 2020. His research interests include machine learning, computer vision, optimization, and their interdisciplinary applications. Most recently, he studies automated machine learning (AutoML), learning to optimize (L2O), robust learning, efficient learning, and graph neural networks.

Jimeng Sun

University of Illinois, Urbana-Champaign
Data, Benchmark and Models to Enable AI in Healthcare

Abstract

Abstract: In this presentation, we address the key components for AI integration in healthcare: data, benchmarks, and models. Firstly, we examine a Hierarchical Autoregressive Language Model that generates high-dimensional Electronic Health Records, facilitating the development of healthcare AI models. Then, we assess the Hierarchical Interaction Network (HINT) paper’s clinical trial outcome prediction as an open benchmark. Finally, we explore an evidence-based spatiotemporal model using Ising dynamics for COVID-19 hospitalization predictions, leveraging specialized graph regularization. This discussion exemplifies how AI, fueled by both synthetic and real data, is set to transform the biotech and healthcare sectors.

Bio

Bio: Jimeng Sun is the Health Innovation Professor at Computer Science Department and Carle’s Illinois College of Medicine at University of Illinois, Urbana-Champaign. Previously, he was with the College of Computing, Georgia Institute of Technology. His research interest include artificial intelligence for healthcare, deep learning for drug discovery, clinical trial optimization, computational phenotyping, clinical predictive modeling, treatment recommendation, and health monitoring.

Panelists

Michael Galkin
Intel Labs

Bio

Bio: Michael Galkin is a Research Scientist at Intel AI Labs working on Graph Machine Learning and Geometric Deep Learning. Previously, Michael was a postdoc at Mila - Quebec AI Institute, working with Will Hamilton, Jian Tang, and Reihaneh Rabbany on various graph learning tasks ranging from reasoning and knowledge graphs to molecular representation learning.

Neil Shah
Snap Research

Bio

Bio: Neil Shah is a Lead Research Scientist and Manager at Snap Research, working on machine learning algorithms and applications on large-scale graph data. His work has resulted in 55+ conference and journal publications, in top venues such as ICLR, NeurIPS, KDD, WSDM, WWW, AAAI and more, including several best-paper awards. He has also served as an organizer, chair and senior program committee member at a number of these. He has had previous research experiences at Lawrence Livermore National Laboratory, Microsoft Research, and Twitch. He earned a PhD in Computer Science in 2017 from Carnegie Mellon University’s Computer Science Department, funded partially by the NSF Graduate Research Fellowship.

Yujun Yan
Dartmouth College

Bio

Bio: Yujun Yan is an Assistant Professor from the Computer Science Department at Dartmouth College. She obtained her PhD from the University of Michigan, Ann Arbor, in 2022. Her area of specialization is graph-based machine learning, with a particular focus on generalizing graph neural networks to graphs with diverse properties, such as varying levels of heterophily and sizes. Her research on heterophily graphs has received widespread recognition and hundreds of citations. Yujun has published multiple papers at top machine learning and data mining conferences, including NeurIPS, KDD, and the WebConf. Her works have been integrated into the curricula of esteemed institutions such as Stanford University and Northeastern University. During her Ph.D., She completed internships at Microsoft Research in 2018 and 2021, as well as at Google Research in 2019 and 2020. She holds a pending patent with Google.

Accepted Papers

Impact-Oriented Contextual Scholar Profiling using Self-Citation Graphs

Yuankai Luo (Beihang University); Fengli Xiao (Beihang University); Lei Shi (Beihang University)
Abstract
Abstract: Quantitatively profiling a scholar's scientific impact is important to modern research society. Current practices with bibliometric indicators (e.g., h-index), lists, and networks perform well at scholar ranking, but do not provide structured context for scholar-centric, analytical tasks such as profile reasoning and understanding. This work presents GeneticFlow (GF), a suite of novel graph-based scholar profiles that fulfill three essential requirements: structured-context, scholar-centric, and evolution-rich. We propose a framework to compute GF over large-scale academic data sources with millions of scholars. The framework encompasses a new unsupervised advisor-advisee detection algorithm, a well-engineered citation type classifier using interpretable features, and a fine-tuned graph neural network (GNN) model. Evaluations are conducted on the real-world task of scientific award inference. Experiment outcomes show that the F1 score of best GF profile significantly outperforms alternative methods of impact indicators and bibliometric networks in all the 6 computer science fields considered. Moreover, the core GF profiles, with 63.6%~66.5% nodes and 12.5%~29.9% edges of the full profile, still significantly outrun existing methods in 5 out of 6 fields studied. Visualization of GF profiling result also reveals human explainable patterns for high-impact scholars.
PDF Code & Datasets
A critical look at the evaluation of GNNs under heterophily: Are we really making progress?

Oleg Platonov (Yandex Research); Denis Kuznedelev (Yandex); Michael Diskin (Yandex); Artem Babenko (Yandex); Liudmila Prokhorenkova (Yandex)
Abstract
Abstract: Node classification is a classical graph machine learning task on which Graph Neural Networks (GNNs) have recently achieved strong results. However, it is often believed that standard GNNs only work well for homophilous graphs, i.e., graphs where edges tend to connect nodes of the same class. Graphs without this property are called heterophilous, and it is typically assumed that specialized methods are required to achieve strong performance on such graphs. In this work, we challenge this assumption. First, we show that the standard datasets used for evaluating heterophily-specific models have serious drawbacks, making results obtained by using them unreliable. The most significant of these drawbacks is the presence of a large number of duplicate nodes in the datasets Squirrel and Chameleon, which leads to train-test data leakage. We show that removing duplicate nodes strongly affects GNN performance on these datasets. Then, we propose a set of heterophilous graphs of varying properties that we believe can serve as a better benchmark for evaluating the performance of GNNs under heterophily. We show that standard GNNs achieve strong results on these heterophilous graphs, almost always outperforming specialized models. Our datasets and the code for reproducing our experiments are available at https://github.com/yandex-research/heterophilous-graphs
PDF Code & Datasets
An Out-of-the-Box Application for Reproducible Graph Collaborative Filtering extending the Elliot Framework

Daniele Malitesta (Polytechnic University of Bari); Claudio Pomo (Polytechnic University of Bari); Vito Walter Anelli (Polytechnic University of Bari); Tommaso Di Noia (Politecnico di Bari); Antonio Ferrara (Politecnico di Bari)
Abstract
Abstract: Graph convolutional networks (GCNs) are taking over collaborative filtering-based recommendation. Their message-passing schema effectively distills the collaborative signal throughout the user-item graph by propagating informative content from neighbor to ego nodes. In this demonstration, we show how to run complete experimental pipelines with six state-of-the-art graph recommendation models in Elliot (i.e., our framework for recommender system evaluation). We seek to highlight four main features, namely: (i) we support reproducibility in PyTorch Geometric (i.e., the library we use to implement the baselines); (ii) reproduced graph models span across various GCN families; (iii) we prepare a Docker image to provide a self-consistent ecosystem for the running of experiments. Codes, datasets, and a video tutorial to install and launch the application are accessible at: https://github.com/sisinflab/Graph-Demo.
PDF Code
Examining the Effects of Degree Distribution and Homophily in Graph Learning Models

Mustafa Yasir (University of Warwick); John Palowitch (Google); Anton Tsitsulin (Google); Long Tran-Thanh (University of Warwick); Bryan Perozzi (Google Research)
Abstract
Abstract: Despite a surge in interest in GNN development, homogeneity in benchmarking datasets still presents a fundamental issue to GNN research. GraphWorld is a recent solution which uses the Stochastic Block Model (SBM) to generate diverse populations of synthetic graphs for benchmarking any GNN task. Despite its success, the SBM imposed fundamental limitations on the kinds of graph structure GraphWorld could create. In this work we examine how two additional synthetic graph generators can improve GraphWorld's evaluation; LFR, a well-established model in the graph clustering literature and CABAM, a recent adaptation of the Barabasi-Albert model tailored for GNN benchmarking. By integrating these generators, we significantly expand the coverage of graph space within the GraphWorld framework while preserving key graph properties observed in real-world networks. To demonstrate their effectiveness, we generate 300,000 graphs to benchmark 11 GNN models on a node classification task. We find GNN performance variations in response to homophily, degree distribution and feature signal. Based on these findings, we classify models by their sensitivity to the new generators under these properties. Additionally, we release the extensions made to GraphWorld on the GitHub repository, offering further evaluation of GNN performance on new graphs.
PDF Code & Datasets
Web-Scale Academic Name Disambiguation: the WhoIsWho Benchmark, Leaderboard, and Toolkit

Bo Chen (Tsinghua University); Jing Zhang (Renmin University of China); Fanjin Zhang (Tsinghua University); Tianyi Han (Zhipu.AI); Yuqing Cheng (Zhipu.AI); Xinyan Li (Zhipu.AI); Yuxiao Dong (Tsinghua); Jie Tang (Tsinghua University)
Abstract
Abstract: Name disambiguation -- a fundamental problem in online academic systems -- is now facing greater challenges with the increasing growth of research papers. For example, on AMiner, an online academic search platform, about 10% of names own more than 100 authors. Such real-world challenging cases have not been effectively addressed by existing researches due to the small-scale or low-quality datasets that they have used. The development of effective algorithms is further hampered by a variety of tasks and evaluation protocols designed on top of diverse datasets. To this end, we present WhoIsWho owning, a large-scale benchmark with over 1,000,000 papers built using an interactive annotation process, a regular leaderboard with comprehensive tasks, and an easy-to-use toolkit encapsulating the entire pipeline as well as the most powerful features and baseline models for tackling the tasks. Our developed strong baseline has already been deployed online in the AMiner system to enable daily arXiv paper assignments. The public leaderboard is available at http://whoiswho.biendata.xyz/. The toolkit is at https://github.com/THUDM/WhoIsWho. The online demo of daily arXiv paper assignments is at https://na-demo.aminer.cn/arxivpaper.
PDF Code & Datasets
NeuroGraph: Benchmarks for Graph Machine Learning in Brain Connectomics

Anwar Said (Vanderbilt University); Roza G Bayrak (Vanderbilt University); Tyler Derr (Vanderbilt University); Mudassir Shabbir (Vanderbilt Universtiy); Daniel C Moyer (Vanderbilt University); Catie Chang (Vanderbilt University); Xenofon Koutsoukos (Vanderbilt University)
Abstract
Abstract: Machine learning provides a valuable tool for analyzing high-dimensional functional neuroimaging data, and is proving effective in predicting various neurological conditions, psychiatric disorders, and cognitive patterns. In functional Magnetic Resonance Imaging (MRI) research, interactions between brain regions are commonly modeled using graph-based representations. The potency of graph machine learning methods has been established across myriad domains, marking a transformative step in data interpretation and predictive modeling. Yet, despite their promise, the transposition of these techniques to the neuroimaging domain remains surprisingly under-explored due to the expansive preprocessing pipeline and large parameter search space for graph-based datasets construction. In this paper, we introduce NeuroGraph, a collection of graph-based neuroimaging datasets that span multiple categories of behavioral and cognitive traits. We delve deeply into the dataset generation search space by crafting 35 datasets within both static and dynamic contexts, running in excess of 15 baseline methods for benchmarking. Additionally, we provide generic frameworks for learning on dynamic as well as static graphs. Our extensive experiments lead to several key observations. Notably, using correlation vectors as node features, incorporating larger number of regions of interest, and employing sparser graphs lead to improved performance. To foster further advancements in graph-based data driven Neuroimaging, we offer a comprehensive open source Python package that includes the datasets, baseline implementations, model training, and standard evaluation. The package is publicly accessible at https://anwar-said.github.io/anwarsaid/neurograph.html.
PDF Code & Datasets
TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs

Phitchaya Mangpo Phothilimthana (Google Research); Sami A Abu-El-Haija (Google); Kaidi Cao (Stanford University); Bahare Fatemi (University of British Columbia); Charith Mendis (University of Illinois at Urbana-Champaign); Bryan Perozzi (Google Research)
Abstract
Abstract: Precise hardware performance models play a crucial role in code optimizations. They can assist compilers in making heuristic decisions or aid autotuners in identifying the optimal configuration for a given program. For example, the autotuner for XLA, a machine learning compiler, discovered 10-20% speedup on state-of-the-art models serving substantial production traffic at Google. Although there exist a few datasets for program performance prediction, they target small sub-programs such as basic blocks or kernels. This paper introduces TpuGraphs, a performance prediction dataset on full tensor programs, represented as computational graphs, running on Tensor Processing Units (TPUs). Each graph in the dataset represents the main computation of a machine learning workload, e.g., a training epoch or an inference step. Each data sample contains a computational graph, a compilation configuration, and the execution time of the graph when compiled with the configuration. The graphs in the dataset are collected from open-source machine learning programs, featuring popular model architectures (e.g., ResNet, EfficientNet, Mask R-CNN, and Transformer). TpuGraphs provides 25x more graphs than the largest graph property prediction dataset (with comparable graph sizes), and 770x larger graphs on average compared to existing performance prediction datasets on machine learning programs. This graph-level prediction task on large graphs introduces new challenges in learning, ranging from scalability, training efficiency, to model quality.
PDF Code & Datasets
Graph Generative Model for Benchmarking Graph Neural Networks

Minji Yoon (Carnegie Mellon University); Yue Wu (None); John Palowitch (Google); Bryan Perozzi (Google Research); Ruslan Salakhutdinov (Carnegie Mellon University)
Abstract
Abstract: As the field of Graph Neural Networks (GNN) continues to grow, it experiences a corresponding increase in the need for large, real-world datasets to train and test new GNN models on challenging, realistic problems. Unfortunately, such graph datasets are often generated from online, highly privacy-restricted ecosystems, which makes research and development on these datasets hard, if not impossible. This greatly reduces the amount of benchmark graphs available to researchers, causing the field to rely only on a handful of publicly-available datasets. To address this problem, we introduce a novel graph generative model, Computation Graph Transformer (CGT) that learns and reproduces the distribution of real-world graphs in a privacy-controlled way. More specifically, CGT (1) generates effective benchmark graphs on which GNNs show similar task performance as on the source graphs, (2) scales to process large-scale graphs, (3) incorporates off-the-shelf privacy modules to guarantee end-user privacy of the generated graph. Extensive experiments across a vast body of graph generative models show that only our model can successfully generate privacy-controlled, synthetic substitutes of large-scale real-world graphs that can be effectively used to benchmark GNN models.
PDF Code & Datasets
A Metadata-Driven Approach to Understand Graph Neural Networks

Ting Wei Li (University of Michigan); Qiaozhu Mei (University of Michigan); Jiaqi Ma (University of Illinois Urbana-Champaign)
Abstract
Abstract: Graph Neural Networks (GNNs) have achieved remarkable success in various applications, but their performance can be sensitive to specific data properties of the graph datasets they operate on. Current literature on understanding the limitations of GNNs has primarily employed a model-driven approach that leverages heuristics and domain knowledge from network science or graph theory to model the GNN behaviors, which is time-consuming and highly subjective. In this work, we propose a metadata-driven approach to analyze the sensitivity of GNNs to graph data properties, motivated by the increasing availability of graph learning benchmarks. We perform a multivariate sparse regression analysis on the metadata derived from benchmarking GNN performance across diverse datasets, yielding a set of salient data properties. To validate the effectiveness of our data-driven approach, we focus on one identified data property, the degree distribution, and investigate how this property influences GNN performance through theoretical analysis and controlled experiments. Our theoretical findings reveal that datasets with a more balanced degree distribution exhibit better linear separability of node representations, thus leading to better GNN performance. We also conduct controlled experiments using synthetic datasets with varying degree distributions, and the results align well with our theoretical findings. Collectively, both the theoretical analysis and controlled experiments verify that the proposed metadata-driven approach is effective in identifying critical data properties for GNNs.
PDF Code & Datasets

Organizers

Please contact us through this email address if you have any questions.

A list of organizers can also be found here.

Program Committee

Ayan Chatterjee (Northeastern University)
Baixiang Huang (National University of Singapore)
Brandon A Mayer (Google)
Chen Shao (Karlsruhe Institute of Technology)
Chenhui Deng (Cornell University)
Delvin Ce Zhang (Singapore Management University)
Dingsu Wang (University of Illinois at Urbana-Champaign)
Dongjin Song (University of Connecticut)
Dongkwan Kim (KAIST)
Jian Kang (University of Rochester)
Jiarui Lu (Mila)

Johannes Gasteiger (Technical University of Munich)
Leonardo F. R. Ribeiro (TU Darmstadt)
Mehdi Azabou (Georgia Institute of Technology)
Neil Shah (Snap Inc.)
Oliver Kiss (Central European University)
Rik Sarkar (The University of Edinburgh)
Shuangjia Zheng (Sun Yat-sen University)
Sungsoo Ahn (POSTECH)
Wenqing Zheng (University of Texas at Austin)
Xingjian Zhang (University of Michigan)

Workshop on Graph Learning Benchmarks (GLB 2023)

Overview

Schedule

Keynote Speakers

Yizhou Sun

Da Zheng

Xavier Bresson

Atlas Wang

Jimeng Sun

Panelists

Accepted Papers

Organizers

Program Committee

Workshop on Graph Learning Benchmarks
(GLB 2023)