0% found this document useful (0 votes)
18 views

GPT4Graph

Uploaded by

pouyarezvani79
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

GPT4Graph

Uploaded by

pouyarezvani79
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

GPT4Graph: Can Large Language Models Understand Graph Structured

Data? An Empirical Evaluation and Benchmarking


Jiayan Guo1∗, Lun Du2†, Hengyu Liu3 , Mengyu Zhou2 , Xinyi He4 , Shi Han2
1
School of Intelligence Science and Technology, Peking University;
2
Microsoft; 3 University of Technology Sydney; 4 Xi’an Jiaotong University
[email protected], {lun.du,mezhou,shihan}@microsoft.com,
[email protected],[email protected]

Abstract The first category includes structure understand-


ing tasks like identifying significant nodes, cal-
Large language models (LLM) like ChatGPT culating centrality metrics (Okamoto et al., 2008;
have become indispensable to artificial general Zhang and Luo, 2017; Brandes, 2001; Barthelemy,
intelligence (AGI), demonstrating excellent per-
2004; Newman, 2005), and determining diame-
arXiv:2305.15066v2 [cs.AI] 11 Jul 2023

formance in various natural language process-


ing tasks. Graph data is ubiquitous and an es- ters (Chung et al., 1994). The second category
sential part of AGI. The training corpus of large encompasses semantic understanding tasks, such
language models often includes some algorith- as knowledge graph question answering (can be ab-
mic components, which allows them to achieve stracted as knowledge graph ) (Huang et al., 2019;
certain effects on some graph data-related prob- Zhang et al., 2018), node classification (Bhagat
lems. However, there is still little research on et al., 2011; Rong et al., 2019) and graph classifi-
their performance on a broader range of graph-
cation (Errica et al., 2019), etc. These tasks have
structured data. In this paper, we conduct an
empirical study to assess the proficiency of distinct requirements and challenges.
LLMs in comprehending graph data, employ- Previous researches have investigated the use
ing a diverse range of structural and semantic- of LLMs for structural understanding (Sui et al.,
related tasks that evaluate the LLMs’ capabili- 2023; Jiang et al., 2023; Gong et al., 2020; Liu
ties in graph understanding. Through our study, et al., 2022), but the emphasis has been predomi-
we uncover current limitations and future di-
nantly on tables, which rely heavily on structured
rections of LLMs in comprehending graph and
performing associated reasoning tasks. tabular data. Graphs, on the other hand, introduce
additional dimensions of complexity. Comprised of
1 Introduction nodes that represent entities or concepts, and edges
that express relationships between these entities,
Large Language Models (LLMs) have demon- graphs necessitate a more sophisticated level of
strated significant capability across a diverse array comprehension from LLMs. Understanding graph
of human-centric tasks. These tasks range from structued data with LLM remains challenges. First
answering questions to performing semantic anal- of all, graph data can not be directly handled by
ysis and identifying named entities (Zhao et al., LLM, as graph data are unorganized and complex.
2023). Despite the considerable strides that have Secondly, there is a wide range of graph-related
been made, the capacity of LLMs to decipher and tasks, designing efficient input format for different
manage structured knowledge, especially in the tasks and effective prompt techniques is essential
form of graph-structured data, remains an area ripe while rarely explored.
for exploration. Understanding graph-structured In this paper, our goal is to setup a comprehen-
data is vital, given its pervasive presence and in- sive comparison to show the ability of LLM in
tegral role in a multitude of applications such as understanding graph structured data. To achieve
social network analysis, drug discovery, recom- this goal, we first bridge the existing gap between
mender systems, and spatio-temporal prediction. Large Language Models (LLMs) by proposing a
Understanding graph data is crucial for AGI. novel framework that integrates LLMs and graph-
Tasks based on graph data can be broadly clas- structured data, intending to enhance their syner-
sified into two categories based on their goals. gistic ability across a wide range of graph min-
*
Work done during the internship at MSRA. ing tasks. Based on the framework, we estab-

Corresponding Author. lish a benchmark across ten common scenarios
Graph Description Languages
to assess language models’ capability in handling Graph Structured Data
Adjacency
graph-related tasks. In addition, we experiment Edge List
List
GraphML

with various prompting methods, including both A

handcrafted and self-generated prompts, to demon- Query


Collaboration Network
Prompt
strate their effectiveness in boosting performance Handler
What is the clustering
coefficient of node A ?

in both zero-shot and few-shot settings. Our find-


ings reveal that while LLMs have demonstrated Knowledge Graph
Answer
…… LLMs
some capability in handling graph-structured data, Moleculer Graph
The clustering coefficient
of node A is 0.167

there remains a substantial need for further devel-


opment to achieve a performance level comparable Figure 1: Graph Understanding with LLM Framework.
to specialized graph-oriented models. In summary, The graph data is first converted to graph description
our contribution can be summarized by: language that can be understand by LLM. Then the
prompt handler combines user query and GDL with
• We introduce a new framework that combines potential multiple rounds to generate the answer.
Large Language Models (LLMs) and graph-
structured data. This setup uses the language 2.2 Graph Description Language
understanding skills of LLMs and graph de-
scription language with promt engineering to A graph description language is a formal lan-
improve how they work together in different guage or notation used to define or represent graph-
situations. structured data. It provides a standardized syntax
and semantics for describing the elements and re-
• We develope a wide-ranging set of tasks, lationships within a graph. Graph description lan-
across ten common scenarios, to check how guages enable the creation, manipulation, and in-
well LLMs can handle tasks involving graph terpretation of graphs in a consistent and machine-
data. This set of taks provides a consistent readable manner. These languages provide a way
way to check how good language models are to define graph structures, specify node and edge
at dealing with complex graph data. properties, and perform queries and operations on
graphs. They are essential for working with graph
• Our empirical results show that, while LLMs data and enabling interoperability between graph-
are getting better at handling graph data, they based systems and tools. For example, graphs can
still have a lot of improving to do if they are be represented by an edge list or an adjacency list,
to catch up with models that are specifically providing two distinct perspectives on the graph’s
designed to work with graphs. structure. An edge list defines a graph in terms of
its individual connections, whereas an adjacency
2 Preliminary list describes each node in terms of its neighbor-
2.1 Graph Mining Tasks ing nodes. Along with these basic representations,
more sophisticated formats have been developed
Graph mining tasks refer to the process of extract-
to convey richer, contextual information about the
ing valuable and actionable insights from graph-
graph. For instance, the Graph Modelling Lan-
structured data. Graphs are mathematical structures
guage (GML)(Himsolt, 1997) and Graph Markup
that represent relationships between entities, where
Language (GraphML)(Brandes et al., 2013) pro-
nodes represent entities and edges represent the
vide extensible, language-based frameworks for
connections or interactions between them. Graph
graph representation.
mining involves analyzing these graphs to discover
patterns, relationships, communities, and other use- 3 Graph Understanding with LLM
ful information. Some graph mining tasks include Pipeline
node classification, link prediction, and community
detection. These tasks are crucial in various do- The overall pipline of graph understanding with
mains, including social network analysis (Wasser- LLMs is illustrated in Figure 1. Where for graph
man and Faust, 1994), bioinformatics (Baxevanis data, we first generate their graph description lan-
et al., 2020), recommendation systems (Isinkaye guages (GDL), and then use prompt handler to
et al., 2015), fraud detection (Bolton and Hand, combine the user query and GDL to form the input
2002), and knowledge graphs (Ji et al., 2021). to the LLMs. The LLMs performs reasoning and
Instructor:
New Contexts: Final Output:
You are a brilliant graph master that can handle anything
related to graphs like retrieval, detection and classification. Node P357 has 4 neighbors, where each of which
Graph description language: are about anomaly detection with statsitical
The clustering coefficient of a node is the ratio of the
<?xml version='1.0' encoding='utf-8'?> models. The whole graph contains 5 nodes and 10
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"> number of closed neighbors and all possible closed
<key id="relation" for="edge" attr.name="relation" attr.type="string" /> edges and describes the citation relations. …….
<key id="title" for="node" attr.name="title" attr.type="string" />
<graph edgedefault="undirected">
neighbors. To compute the clustering coefficient of node
<node id="P357">
<data key="title">statistical anomaly detection via composite hypothesi models</ P357, we fisrt know that node P357 has 4 neighbors
data>
</node>
<node id="P79639">
…….
<data key="title">universal and composite hypothesis testing</data>
</node>
that is there are totally 6 possible triangles between
them. Among these neighbors, there are only 2 of them
. . . . . .
<edge source="P357" target="P79639"> Generate Generate
<data key="relation">reference</data>
</edge> New Contexts Final Output connected with each other, which forms one triangle.
. . . . . .
</graph> Thus the clustering coefficient of node P357 is 1 / 6 =
LLMs
</graphml>

Context: XXXXXX 0.167.


:
Query:
What is the clustering coefficient of node P357 ?

Figure 2: Illustration of Self-prompting. The first request is to ask LLMs to automatically generate the context of
the input graph (w/ or w/o respect to the question). We may ask LLMs multiple context related questions. After
generating the new context (such as context summarization and format explanation). the new context is combined
with the original input and are send to the LLMs to generate the final output.

generate answers to the user. During the reason- obtain more context or eliminating irrelevant infor-
ing, the LLMs may generate intermedium output mation from the given input. It can be challenging
that should be handled by prompt handler to form for LLM to generate effective prompts for graph-
new input to the LLMs. Here we elaborate the based tasks, as graphs have complex structures and
prompt handler to show how to make LLM better relationships that need to be accurately captured
understand graph data. in the prompt. However, there are several strate-
gies that can be employed for self-prompting in
3.1 Manual Prompting graph-based tasks.
Manual prompting for graph-based problems in- Context Summarization: LLM can generate a
volves utilizing familiar graph representations to summary of the given graph by extracting key fea-
prompt a large language model (LLM) for desired tures, such as important nodes, edges, or subgraphs.
outputs. The novelty of this approach lies in the The generated summary can serve as a prompt for
fact that it requires a shift from traditional text- the subsequent graph-related questions or tasks.
based inputs to graphical representations. These Besides, based on some important elements like
graph formats have been discussed in Section 2.2. nodes and edges, we can use LLM to summarize
By employing these graph formats as input, we can their context (neighborhood) information to form
provide more comprehensive and context-rich in- neighborhood aware text features.
formation about the graph to the LLM. Other man- Format Explanation: Sometimes it is hard for a
ual prompting techniques include adding format human to give the entire description of the input
explanation to make LLM better understand the graph format. To make the LLM gain more context
format and adding role prompting to make LLM information of the input graph, we can make the
better understand the specific task. Besides, we LLM to generate format explanation by itself.
can also change the input order between question By leveraging these self-prompting strategies,
and external input, and adding examples to utilize LLM can actively engage in the understanding and
the in-context learning ability (Wei et al., 2021) of manipulation of graphs, facilitating graph-based
LLM. reasoning and learning.
Nonetheless, some recent developed change-of-
4 Graph Understanding Benchmark
thoughts promptings (Kojima et al., 2022; Yao
et al., 2023) can also be applied to enhance the 4.1 Structure Understanding Tasks
reasoning ability of LLM for there are many tasks Graph Size Detection. This task evaluates a large
requiring multiple step of reasoning (e.g clustering language model’s (LLM) capability to discern the
coefficient computing). size of a provided graph. In this context, size refers
to the count of nodes and edges present in the graph.
3.2 Self-Prompting
The LLM is expected to accurately determine these
Sometimes the given graph context contains less metrics, even when user-provided designs and ac-
useful or redundant information for solving tasks. companying data, such as descriptions, statements,
Thus we need LLM to perform self-prompting to or queries, augment the graph. Despite the inherent
challenge this poses for language models, a precise Size Detection Degree Detection Edge Detection

count of nodes and edges is critical, as it enables


T
the LLM to contextualize information accordingly. ?
? X2
5 nodes
Degree Detection. This task investigates the 4 Edges Degree 3
X1
has an edge

LLM’s aptitude for understanding a node’s con-


textual relevance within a graph. Here, the degree Title
Semi-supervised Node
Classification with
Graph Convolutional
Network

of a node—an indicator of a node’s importance We present a scalable


T
approach for semi-

and the sparsity of its connections—forms the crux Abstract


supervised learning …
related methods by a
significant margin.
Diameter 3 Clustering: 0.33
of the task. The LLM must ascertain the number Atrribute Retrieval Diameter Clustering
of neighbors for a given node, based on the graph
text and any supplementary information. The de- Figure 3: Structure Understanding Tasks
gree of a node is foundational for various centrality
measures such as degree centrality and clustering KGQA GQL Generation
coefficient, underscoring the task’s importance in Robert
Zemeckis is di
The
Terminal Robert
The
Terminal
rect Zemeckis is di
ed by

is starred
rect

understanding a node’s local structure. ed by

is starred
Forest
Forest
tr y Gump is st

by
coun
arr
tr y Gump is st

by
ed arr
by coun ed
by
Tom
U.S. Tom
Hanks U.S. Hanks

Edge Detection. Building on degree detection, this Query: The director who directs Query: The director who directs Forest Gump
also direct what?Use Cypher to answer
task further explores the LLM’s understanding of Forest Gump also direct what?
Answer: MATCH (m1)-[is directed by]->(d)-[direct]->(m2)
Answer: Back to the Future RETURN m2

a node’s local structure. The model must identify


the neighboring nodes of a given node, a skill that O
A C
is vital for complex graph mining activities like D
C C C
H
calculating distances and discerning connectivity B C C
patterns. Mastery of this task signifies the LLM’s E C
comprehension of the fundamental aspects neces- Query: What is the class of node C ?Use
Abbreviation to answer.
Query: Is this molecule active with H3C4 ?

sary for advanced graph analysis. Answer: CS.AI Answer: No.

Node Classification Graph Classification


Attribute Retrieval. This task tests the LLM’s
Figure 4: Semantic Understanding Tasks
capacity to retrieve pertinent details about a node,
such as the node’s attributes, which play a key role
in defining its characteristics. For instance, the and its ability to evaluate the degree of clustering
LLM might need to retrieve a specific attribute within a graph. Besides, it tests the ability of rea-
such as a paper’s title or an author’s gender. Suc- soning of LLM for computing CC has several steps.
cess in this task highlights the LLM’s ability to
comprehend and retrieve essential node-related in- 4.2 Semantic Understanding Tasks
formation. Knowledge Graph Question Answering. This
task gauges the LLM’s proficiency in answer-
Diameter Computing. This task challenges the ing questions that pertain to a knowledge graph.
LLM to calculate the diameter of a graph. The di- Knowledge graphs organize data into a structured
ameter, which is the longest shortest path between format, embodying entities, attributes, and relation-
any two nodes, offers valuable insights into the ships. Task success hinges on the LLM’s ability to
graph’s overall connectivity and reachability. A reason and understand the underlying graph struc-
successful computation of the diameter showcases ture to provide accurate answers, thus demonstrat-
the LLM’s grasp of the graph’s structure and its ing its semantic understanding and capability to
ability to analyze the graph’s overarching charac- navigate and extract information from a KG.
teristics.
Clustering Coefficient Computing. In this Graph Query Language Generation. This task
task, the LLM needs to compute the clustering co- measures the LLM’s capability to generate graph
efficient of a graph, a measure that indicates how query languages that satisfy user requirements.
closely nodes in a graph tend to cluster together. These languages, including GQL and Cypher, al-
The task thereby provides a means to assess the low users to extract specific information from a
LLM’s understanding of local connectivity patterns graph database. By generating appropriate graph
queries in response to user information needs, the to the subgraph or not, based on the information
LLM showcases its comprehension of user intent provided by the subgraph.
and precision in query formulation.
5.2 Semantic Understanding Task
Node Classification. This task requires the LLM Shifting our focus to semantic understanding tasks,
to classify nodes within a graph based on their at- we conducted knowledge graph question answering
tributes or structural features. The LLM is given using two widely-used datasets: Wiki, a temporal
labeled node examples and their associated classes, knowledge graph, and MetaQA, a multi-hop movie
and it must correctly predict the class of unseen knowledge base. These datasets served as a testing
nodes by applying learned patterns from the la- ground to evaluate the performance of the language
beled data. Success in node classification show- model in these domains. For node classification,
cases the LLM’s ability to generalize from exam- we leveraged the original labels available in the
ples and apply its understanding of node attributes ogbn-arxiv dataset. We randomly sampled 100
and structure to classify new nodes accurately. nodes from the test set and tasked the language
model with predicting their labels based on infor-
Graph Classification. This task extends the scope mation such as the node’s title, abstract, and the
of node classification to encompass entire graphs. text information from its k-hop neighbors. In ad-
The LLM is given graphs, each labeled with spe- dition, we explored graph query language genera-
cific categories or classes, and is expected to ac- tion using the MetaQA dataset. We constructed a
curately classify unseen graphs by using patterns graph database from this dataset and prompted the
learned from the labeled examples. This task evalu- language model to generate corresponding graph
ates the LLM’s ability to understand and apply the query languages (GQL) like Cypher. The gener-
structural and attribute-based characteristics of a ated GQL statements were then executed using
graph holistically, thus enabling accurate classifica- the Neo4j engine. Through these experiments, we
tion of new graphs. aime to assess the language model’s performance
in various tasks related to structural and semantic
5 Data Collection understanding in graph structured data.
5.1 Structure Understanding Task 6 Experiments
To demonstrate the capabilities of language models
6.1 Experimental Setup
in reasoning over Structure Understanding Tasks,
we selected two well-known citation networks: Downstream Task.
obgn-arxiv (Hu et al., 2020) and Aminer (Tang Models. We evaluate the performance of the recent
et al., 2008). Our approach involved randomly dominant LLM model, InstructGPT-3 (Ouyang
sampling 100 initial seed nodes from each graph et al., 2022), using versions text-davinci-001, text-
and applying a Depth-First Search (DFS) algorithm davinci-002, and text-davinci003. Unless other-
to sample 2-hop subgraphs centered around these wise specified, we utilize text-davinci-003 in all
nodes. Each subgraph consisted of approximately experiments. The temperature is set to 0.3 to con-
10-20 nodes and 40 edges. To evaluate the per- trol the variety of the output.
formance of the language model, we assigned it
the following tasks within these subgraphs: de- 6.2 Results for Structure Understanding Task
gree detection, attribute retrieval, clustering, size The results for the Structure Understanding Task
detection, and diameter estimation. For the first are presented in Table 1, revealing several signifi-
three tasks, the model provided results for each in- cant findings:
dividual node in the subgraphs. However, for size Input Design Has a Significant Impact on the
detection and diameter estimation, we computed Final Result. Our experiments demonstrate that
the results for each entire subgraph. Another task the design of the input plays a crucial role in deter-
we tackled was Edge Detection. Here, we treated mining the performance of the model. By carefully
each edge in the graph as a positive sample and considering the arrangement and organization of
randomly selected an edge not present in the graph the input data, we can substantially influence the
as a negative sample. We then asked the language model’s ability to understand the structural aspects
model to determine whether a given edge belonged of the task at hand. Fine-tuning the input design can
Table 1: Experiments on Graph Structural Understanding on OGBN-ARXIV. ACC indicates average accuracy over
samples, while ∆ indicates the difference of variants with the 1-shot setting. - denotes that the input format do not
contain corresponding information.

Size Detection Degree Detection Edge Detection Attribute Retrieval Diameter Clustering
Format Input Design
ACC ∆ ACC ∆ ACC ∆ ACC ∆ ACC ∆ ACC ∆
1-shot 35.50 0.00 15.21 0.00 65.45 0.00 - - 28.00 0.00 5.42 0.00
1-shot-cot 44.00 +8.50 14.58 -0.63 65.25 -0.20 - - 24.00 -4.00 1.85 -3.57
Adjacency w/o format explanation 33.00 -0.25 16.34 +1.13 57.50 -8.25 - - 18.00 -10.00 5.19 +3.43
List w/o role prompting 36.60 +1.10 15.70 +0.49 55.00 -10.45 - - 20.00 -8.00 4.71 -0.23
w/o change order 14.00 -21.50 26.28 +11.07 51.20 -14.25 - - 30.00 +2.00 14.92 -9.50
w/o 1-shot 33.00 -2.50 17.18 +1.97 71.90 -6.45 - - 22.00 -6.00 7.85 +2.43
1-shot 22.50 0.00 44.87 0.00 74.60 0.00 - - 43.00 0.00 13.31 0.00
1-shot-cot 27.00 +4.50 48.65 +3.78 74.70 +0.10 - - 41.00 -2.00 11.33 -1.98
Edge List w/o format explanation 25.00 +2.50 47.86 +2.99 71.55 -3.05 - - 36.00 -7.00 18.11 +4.80
w/o role prompting 18.00 -4.50 47.64 +2.57 71.70 -2.90 - - 39.00 -4.00 13.63 +0.35
w/o change order 9.00 -13.50 20.48 -23.39 79.60 +5.00 - - 10.00 -33.00 20.06 + 7.05
w/o 1-shot 23.00 +0.50 49.34 +4.47 80.95 +6.35 - - 34.00 -9.00 19.16 +5.84
1-shot 54.50 0.00 20.91 0.00 50.45 0.00 83.40 0.00 37.00 0.00 4.36 0.00
1-shot-cot 55.50 +1.00 20.76 -0.15 50.10 -0.35 83.30 -0.10 28.00 -9.00 0.95 -3.41
GML w/o format explanation 55.00 -0.50 29.06 +8.15 50.00 -0.45 85.97 +2.57 41.00 +4.00 12.71 +8.35
w/o role prompting 54.50 -0.50 29.79 +8.88 50.00 -0.45 84.50 +0.10 35.00 -2.00 6.96 +2.60
w/o change order 51.50 -3.00 21.16 +0.24 55.65 +5.20 83.56 +0.16 39.00 +2.00 5.25 +0.89
w/o 1-shot 54.00 -0.50 19.85 -1.06 50.25 +0.20 83.22 -0.18 42.00 +5.00 5.39 +1.03
1-shot 25.00 0.00 40.20 0.00 62.05 0.00 83.87 0.00 34.00 0.00 9.74 0.00
1-shot-cot 22.50 -2.50 40.02 -0.18 62.30 +0.25 83.75 -0.12 32.00 -2.00 7.29 -2.45
GraphML w/o format explanation 19.00 -6.00 46.90 +5.88 53.75 -8.40 85.37 +1.50 38.00 +4.00 22.75 +13.01
w/o role prompting 15.50 -9.50 49.89 +9.87 56.10 -5.95 87.63 +3.76 31.00 -3.00 14.52 +4.78
w/o change order 8.50 -16.50 30.60 -9.60 65.35 +3.30 9.76 -4.11 43.00 +9.00 8.00 -1.74
0-shot 24.50 -0.50 39.59 -0.61 73.95 +11.90 82.90 -0.97 30.00 -4.00 14.32 +4.58

lead to improved performance and more accurate We investigated the impact of external knowledge,
structural understanding. such as questions, statements, and examples, on
Role Prompting Generally Improves Perfor- graph understanding. Comparing the placement
mance. Our findings indicate that incorporating of external knowledge before or after the graph
role-prompting techniques generally enhances the input, we observed that positioning external knowl-
model’s performance in the Structure Understand- edge before the graph generally led to better per-
ing Task. By explicitly guiding the model to focus formance. Placing external knowledge before the
on specific roles or relationships within the graph, graph provides additional context information, en-
we enable it to extract more meaningful insights abling the model to better comprehend the specific
and make more accurate predictions. Role prompt- graph it needs to handle. Conversely, positioning
ing serves as an effective mechanism for capturing the graph behind external knowledge may hinder
the nuances of the graph’s structure and leveraging the model’s ability to effectively utilize the relevant
that information for improved understanding. information, potentially degrading performance.
These findings show the importance of thought-
Examples Have Impacts on Graph Understand- ful input design, the potential benefits of role
ing. Similar to previous research that suggests prompting techniques, the limited impact of exam-
the utility of examples in large language models ples in graph understanding, and the significance of
(LLMs), we discovered that examples also have positioning external knowledge for optimal perfor-
some extend of positive effects in graph under- mance. Understanding these factors can guide fu-
standing scenarios. However, omitting specific ture research and inform the development of more
examples and relying on zero-shot learning ap- effective models for structure understanding tasks.
proaches sometimes yielded more powerful results.
This phenomenon can be attributed to the rich in- 6.3 Results for Semantic Understanding Task
herent information present within the graph itself, The resuls for semantic understanding tasks are
which allows the model to grasp the complexities shown in Figure 2 and Figure 3. We have the fol-
of the structure without the need for explicit exam- lowing discoveries:
ples. Examples, in some cases, can introduce noise, Resuts for KGQA and GQL generation. The
biases, or incomplete information, hindering the results for KGQA and GQL generation is shown
model’s overall understanding. in Table 2. It’s noticeable that current SOTA mod-
The Position of External Knowledge Matters. els consistently show higher performance across
Table 2: Performance on KGQA and GQL Generation

Method Wiki MetaQA-1hop MetaQA-2hop MetaQA-3hop


SOTA 64.70 97.50 98.80 94.80
zero-shot 9.23 24.75 6.37 9.72
zero-shot-cot 8.71 18.41 12.86 21.89
zero-shot+graph 56.38 91.69 46.82 19.40
zero-shot-cot+graph 55.63 86.16 47.36 19.29
zero-shot+graph+change-order 51.35 95.20 40.48 20.17
zero-shot-cot+graph+change-order 56.33 95.87 47.71 23.95
zero-shot Cypher Generation - 30.00 10.00 13.00
one-shot Cypher Generation - 99.00 77.00 96.00

all datasets, with scores ranging from 94.80 on Table 3: Performance of Node Classification on OGBN-
MetaQA-3hop to 98.80 on MetaQA-2hop. How- ARXIV. self denotes only the use of the text feature of
the target nodes. 1-hop denotes using the text feature of
ever, LLM showed comparative performance on
direct neighbors. 2-hop denotes using the text feature
certain tasks with prompt strategies. Specifically, within 2-hop neighbors.
the ’zero-shot+graph’ method has performed ex-
ceptionally well on the ’Wiki’ dataset, achiev- Method Context ACC
ing an accuracy of 56.38, the highest among
our proposed models. Similarly, the ’zero-shot- self 48.00
cot+graph+change-order’ model performs the best zero-shot 1-hop 53.00
on MetaQA-1hop, scoring 95.87. When we com- 2-hop 57.00
pare zero-shot models with ’zero-shot-cot’ coun- self 40.00
terparts, we observe a general trend that the in- zero-shot-cot 1-hop 40.00
clusion of the graph (’+graph’) and change or- 2-hop 56.00
der (’+change-order’) enhancements improve the
self 50.00
model performance. For the ’one-shot Cypher’
one-shot 1-hop 54.00
method, an impressive performance of 99.00 is
2-hop 60.00
achieved on the MetaQA-1hop, surpassing the
state-of-the-art and all other models in our study. self 43.00
one-shot-cot 1-hop 55.00
Results for Node Classification. For Node Clas- 2-hop 59.00
sification on OGBN-ARXIV (Table 3), the ’one-
shot + 1-hop neighborhood context summarization’
model has the highest accuracy of 60.00 among Results for Graph Classification. The results
all the variants. Interestingly, models augmented for graph classification task is shown in Table 4.
with 2-hop neighborhood context summarization From the result, we find that the self-augmentation
(’2-hop’) show better performance than their 1- is effective in improving the performance of GC.
hop counterparts, showing that expanding context It shows that self-augmentation like self-format
range is helpful in providing valuable information. explanation and self-summarization can enrich the
Also, the model performs better than the change-of- context of the original graph and will make the
thought (cot) model, suggesting that the cot strategy LLM more easily complete the task.
might not be as effective for this task. These results
7 Discussion
indicate potential areas for improvement, partic-
ularly for the ’zero-shot-cot’ and ’change-order’ Our findings suggest several promising directions
strategies, which don’t consistently improve per- for future work in structure understanding tasks
formance. Nonetheless, the experiments provide with LLMs. First, more research is needed to
valuable insights into the performance of different understand how different input designs and role
strategies in the node classification task. prompting techniques can further enhance perfor-
mance. Second, we encourage researchers to in-
Table 4: Performance on Graph Classification

OGBG-MOLHIV OGBG-MOLPCBA
Dataset
GML GraphML GML GraphML
1-shot-tot 66.87 63.25 57.18 57.45
1-shot-cot 67.65 64.71 59.26 57.32
w/o self-format explanation 64.71 64.71 58.73 56.24
w/o self-summarization 61.76 61.77 57.64 56.67
0-shot-cot 58.82 59.76 55.57 55.32

vestigate why examples are less effective for graph with textual information, leveraging both textual
understanding and to explore alternative strategies and structural data to make informed predictions.
for leveraging the rich information embedded in
graphs. Third, the role of external knowledge place- 8.2 Graph Machine Learning
ment merits further exploration. Finally, new ap- Graph machine learning develops models and algo-
proaches for graph augmentation could be devel- rithms for data structured as graphs, representing
oped to improve performance on semantic under- complex relationships in various domains. Tra-
standing tasks. ditional machine learning struggles with graph-
In addition, our experiments have revealed the structured data, but graph machine learning meth-
potential of LLMs in various tasks beyond pure nat- ods utilize the graph structure to extract mean-
ural language processing. We believe that more ef- ingful features and make predictions. Graph con-
fort should be dedicated to integrating graph-based volutional (GCN) networks extend convolutional
information into LLMs, exploring different types neural networks to operate on graph-structured
of graph structures, and applying LLMs to other data, capturing local and global structural patterns
areas such as graph theory, network science, and and excelling in tasks like node classification and
complex systems. In the future, we may also con- graph-level classification (Kipf and Welling, 2016).
sider using LLM to control the use of external tools Graph attention networks (GAT) incorporate atten-
to better handle graph structured data (Schick et al., tion mechanisms, allowing adaptive aggregation of
2023; Zhang, 2023). information from relevant nodes (Velickovic et al.,
2017). GAT perform well in tasks like node clas-
8 Related Works sification and graph-level representation learning.
8.1 Language Model for Structural Data Graph generative models generate new graphs to
Understanding capture the structural characteristics and properties
of the input data, benefiting tasks like molecule
Language models are being extended to understand
generation (Walters and Barzilay, 2020) and graph-
and work with structural data, such as graphs, ta-
based data augmentation (Zhao et al., 2021). Graph
bles, and trees. One approach is using graph neural
machine learning techniques enable effective analy-
networks to encode structural information, captur-
sis and extraction of insights from graph-structured
ing dependencies and relationships between ele-
data, advancing fields relying on understanding
ments (Qasim et al., 2019). Incorporating GNNs
complex relationships and dependencies.
into language models enables them to generate con-
textually aware outputs that consider the structural
9 Conclusion
characteristics of the data. Another approach is
incorporating attention mechanisms into language In this work, we analyze the ability of large lan-
models for structural data (Chen et al., 2022; Eisen- guage models to understand graph-structured data.
schlos et al., 2021). Attention allows the model to Our findings indicate that there is still a long way
focus on relevant parts, improving understanding for a LLM to understand graph data. Future re-
of complex dependencies and enhancing perfor- search should focus on developing and refining
mance in tasks like graph completion and table methods for encoding graph-structured informa-
understanding. Language models can also bene- tion into a format that a large language model can
fit from combining knowledge graph embeddings comprehend and manipulate effectively. This is a
complex challenge given the inherent differences Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong,
between sequential text data and graph data, which Hongyu Ren, Bowen Liu, Michele Catasta, and Jure
Leskovec. 2020. Open graph benchmark: Datasets
is intrinsically multi-dimensional and relational.
for machine learning on graphs. Advances in neural
information processing systems, 33:22118–22133.

References Xiao Huang, Jingyuan Zhang, Dingcheng Li, and Ping


Li. 2019. Knowledge graph embedding based ques-
Marc Barthelemy. 2004. Betweenness centrality in large tion answering. In Proceedings of the twelfth ACM
complex networks. The European physical journal international conference on web search and data min-
B, 38(2):163–168. ing, pages 105–113.
Andreas D Baxevanis, Gary D Bader, and David S
Folasade Olubusola Isinkaye, Yetunde O Folajimi, and
Wishart. 2020. Bioinformatics. John Wiley & Sons.
Bolande Adefowoke Ojokoh. 2015. Recommenda-
Smriti Bhagat, Graham Cormode, and S Muthukrishnan. tion systems: Principles, methods and evaluation.
2011. Node classification in social networks. arXiv Egyptian informatics journal, 16(3):261–273.
preprint arXiv:1101.3291.
Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Martti-
Richard J Bolton and David J Hand. 2002. Statisti- nen, and S Yu Philip. 2021. A survey on knowledge
cal fraud detection: A review. Statistical science, graphs: Representation, acquisition, and applications.
17(3):235–255. IEEE transactions on neural networks and learning
systems, 33(2):494–514.
Ulrik Brandes. 2001. A faster algorithm for between-
ness centrality. Journal of mathematical sociology, Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye,
25(2):163–177. Wayne Xin Zhao, and Ji-Rong Wen. 2023. Struct-
gpt: A general framework for large language model
Ulrik Brandes, Markus Eiglsperger, Jürgen Lerner, to reason over structured data. arXiv preprint
and Christian Pich. 2013. Graph markup language arXiv:2305.09645.
(graphml).
Thomas N Kipf and Max Welling. 2016. Semi-
Miao Chen, Xinjiang Lu, Tong Xu, Yanyan Li, Zhou supervised classification with graph convolutional
Jingbo, Dejing Dou, and Hui Xiong. 2022. To- networks. In International Conference on Learning
wards table-to-text generation with pretrained lan- Representations.
guage model: A table structure understanding and
text deliberating approach. In Proceedings of the Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
2022 Conference on Empirical Methods in Natu- taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
ral Language Processing, pages 8199–8210, Abu guage models are zero-shot reasoners. In Advances
Dhabi, United Arab Emirates. Association for Com- in Neural Information Processing Systems.
putational Linguistics.
Ao Liu, Haoyu Dong, Naoaki Okazaki, Shi Han, and
Fan RK Chung, Vance Faber, and Thomas A Manteuffel. Dongmei Zhang. 2022. PLOG: Table-to-logic pre-
1994. An upper bound on the diameter of a graph training for logical table-to-text generation. In Pro-
from eigenvalues associated with its laplacian. SIAM ceedings of the 2022 Conference on Empirical Meth-
Journal on Discrete Mathematics, 7(3):443–457. ods in Natural Language Processing, pages 5531–
5546, Abu Dhabi, United Arab Emirates. Association
Julian Martin Eisenschlos, Maharshi Gor, Thomas for Computational Linguistics.
Müller, and William W Cohen. 2021. Mate: multi-
view attention for table transformer efficiency. Pro- Mark EJ Newman. 2005. A measure of betweenness
ceedings of the 2021 Conference on Empirical Meth- centrality based on random walks. Social networks,
ods in Natural Language Processing". 27(1):39–54.

Federico Errica, Marco Podda, Davide Bacciu, and Kazuya Okamoto, Wei Chen, and Xiang-Yang Li. 2008.
Alessio Micheli. 2019. A fair comparison of graph Ranking of closeness centrality for large-scale so-
neural networks for graph classification. arXiv cial networks. Lecture Notes in Computer Science,
preprint arXiv:1912.09893. 5059:186–195.

Heng Gong, Yawei Sun, Xiaocheng Feng, Bing Qin, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Wei Bi, Xiaojiang Liu, and Ting Liu. 2020. Tablegpt: Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Few-shot table-to-text generation with table structure Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
reconstruction and content matching. In Proceedings 2022. Training language models to follow instruc-
of the 28th International Conference on Computa- tions with human feedback. Advances in Neural
tional Linguistics, pages 1978–1988. Information Processing Systems, 35:27730–27744.

Michael Himsolt. 1997. Gml: Graph modelling lan- Shah Rukh Qasim, Hassan Mahmood, and Faisal
guage. University of Passau. Shafait. 2019. Rethinking table recognition using
graph neural networks. In 2019 International Con- Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexan-
ference on Document Analysis and Recognition (IC- der Smola, and Le Song. 2018. Variational reasoning
DAR), pages 142–147. IEEE. for question answering with knowledge graph. In
Proceedings of the AAAI conference on artificial in-
Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou telligence, volume 32.
Huang. 2019. Dropedge: Towards deep graph con-
volutional networks on node classification. arXiv Tong Zhao, Yozen Liu, Leonardo Neves, Oliver Wood-
preprint arXiv:1907.10903. ford, Meng Jiang, and Neil Shah. 2021. Data aug-
mentation for graph neural networks. In Proceedings
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta of the aaai conference on artificial intelligence, vol-
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola ume 35, pages 11015–11023.
Cancedda, and Thomas Scialom. 2023. Toolformer:
Language models can teach themselves to use tools. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
arXiv preprint arXiv:2302.04761. Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Zhang, Junjie Zhang, Zican Dong, et al. 2023. A
Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and survey of large language models. arXiv preprint
Dongmei Zhang. 2023. Evaluating and enhancing arXiv:2303.18223.
structural understanding capabilities of large lan-
guage models on tables via input designs. arXiv
preprint arXiv:2305.13062. A Detailed Description of Datasets

Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, A.1 OGBN-ARXIV
and Zhong Su. 2008. Arnetminer: extraction and
mining of academic social networks. In Proceedings The Open Graph Benchmark (OGB) is a collection
of the 14th ACM SIGKDD international conference of diverse, large-scale, and challenging datasets and
on Knowledge discovery and data mining, pages 990– benchmarking tasks for graph machine learning re-
998. search. OGBN-ARXIV is a part of the OGB Node
Property Prediction track. The dataset comprises
Petar Velickovic, Guillem Cucurull, Arantxa Casanova,
Adriana Romero, Pietro Lio, Yoshua Bengio, et al. academic papers from the arXiv website, which
2017. Graph attention networks. stat, 1050(20):10– are represented as nodes in a citation graph. In the
48550. graph, the edges denote the citation relationships
between the papers. Each paper is associated with
W Patrick Walters and Regina Barzilay. 2020. Appli-
cations of deep learning in molecule generation and
a 128-dimensional word2vec feature vector derived
molecular property prediction. Accounts of chemical from its title and abstract. The task associated with
research, 54(2):263–270. this dataset is to predict the subject area of each pa-
per, making it a multi-class classification problem.
Stanley Wasserman and Katherine Faust. 1994. Social We sample a subset of 100 nodes with multi-hop
network analysis: Methods and applications.
neighbors for testing.
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Adams Wei Yu, Brian Lester, Nan Du, Andrew M A.2 OGBG-MOLX
Dai, and Quoc V Le. 2021. Finetuned language mod-
els are zero-shot learners. In International Confer- OGBG-MOLX is part of the Graph Property Predic-
ence on Learning Representations. tion track in OGB and it comprises of two datasets:
MOLHIV and MOLPCBA. MOLHIV dataset con-
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, tains molecular graphs where the task is to pre-
Thomas L Griffiths, Yuan Cao, and Karthik
Narasimhan. 2023. Tree of thoughts: Deliberate dict whether a molecule inhibits HIV virus replica-
problem solving with large language models. arXiv tion or not, making it a binary classification prob-
preprint arXiv:2305.10601. lem. On the other hand, MOLPCBA dataset con-
tains molecular graphs with the task of predict-
Jiawei Zhang. 2023. Graph-toolformer: To em- ing bioactivity for various protein targets, which
power llms with graph reasoning ability via
prompt augmented by chatgpt. arXiv preprint is a multi-label classification problem. In these
arXiv:2304.11116. datasets, nodes represent atoms and edges repre-
sent bonds between atoms. Node and edge features
Junlong Zhang and Yu Luo. 2017. Degree centrality, include atom type, atom degree, bond type, and
betweenness centrality, and closeness centrality in
social network. In 2017 2nd international conference
whether the bond is in a ring. We sample 100
on modelling, simulation and applied mathematics graphs with the same number of positive and nega-
(MSAM2017), pages 300–303. Atlantis press. tive samples for testing.
A.3 Wiki Table 5: Input Design for Different Tasks.

The Wiki dataset is a well-known dataset that con- Task Description


tains text from a collection of Wikipedia articles. Size Detection <graph text> What is the number
The structure of this dataset varies depending on of nodes and edges in this graph?
Please answer with the number of
the particular task. For instance, for text classi- nodes: X, number of edges: X.
fication tasks, each document (or article) can be
Degree Detection <graph text> What is the degree of
represented as a bag-of-words vector, with each node X?
dimension representing the frequency of a specific Edge Detection <graph text> Is there an edge be-
word. The labels may include the categories that tween node X1 and node X2?
the articles belong to. In the context of graph-based Attribute Retrieval What is the title of node X?
tasks, a Wikipedia dataset could consist of a net- Diameter What is the diameter of this graph?
work of articles (as nodes) linked by hyperlinks
Clustering What is the clustering coefficient
(as edges), with the task being predicting article of node X?
categories or link prediction between articles. KGQA Knowledge: <graph text>, Ques-
tion: <question text>
A.4 MetaQA GQL Generation Thus the Neo4j CQL of the ques-
tion is
MetaQA is a dataset designed for the task of multi- Node Classification Which arxiv CS subcategory does
hop reasoning in question answering. It consists of paper <paper title> with abstract
movie-related knowledge graph entities, relations, <paper abstract> belongs to? use
the abbreviation to answer.
and natural language questions. Each node in the
knowledge graph represents a movie entity (such Graph Classification <graph text> Whether the
molecule inhibits HIV virus
as a movie, actor, or director), and edges represent replication? Yes or no.
relationships between entities. The dataset also
includes questions at three levels of complexity (1-
hop, 2-hop, and 3-hop), with each level requiring D Data and Code Release
reasoning over an increasing number of edges in the
The GUC benchmark and codes in this paper will
graph to answer the questions correctly. The goal is
be open sourced at https://anonymous.4open.
to answer natural language questions by effectively
science/r/GPT4Graph. after an internal review.
reasoning over the underlying knowledge graph.
The synthesized labels in the benchmark will be
released under CDLAPermissive-2.0 license. Our
B Input Design for Different Tasks code will be released publicly with MIT license.

The input design for different tasks are shown in


Table 5, where we show the question designs for
different tasks.

C Cypher Introduction

Cypher is a declarative graph query language devel-


oped by Neo4j, a popular graph database manage-
ment system. It allows for expressive and efficient
querying and updating of graph data. The language
is designed to be intuitive and readable, drawing on
the use of English prose and iconography. Cypher
is built around the concept of pattern matching. It
focuses on the clarity of expressing what to retrieve
from a graph, not dictating how to retrieve it. This
design makes Cypher powerful when working with
graph data, as patterns are often more intuitive and
easier to understand.

You might also like