When Graph Structure Hurts - Research Publication

Abstract

Knowledge Graph-augmented Retrieval-Augmented Generation (KG-RAG) systems have emerged as a powerful approach for question answering over structured knowledge. The conventional wisdom suggests that Graph Neural Networks (GNNs) should excel at ranking paths in knowledge graphs due to their ability to aggregate neighborhood information.

However, our research demonstrates a counterintuitive finding: in dense academic citation graphs with high entity overlap, simple MLP-based scorers outperform complex GNN architectures while using 13× fewer parameters.

This paper was accepted to ICMLC 2026 and will be presented in February 2026.

The Research Question

Knowledge graph RAG systems typically follow a two-stage process:

Path Discovery: Find potential paths in the KG connecting query entities to candidate answers
Path Ranking: Score and rank these paths by relevance

For path ranking, GNNs (Graph Convolutional Networks and Graph Attention Networks) are the popular choice. They're designed to leverage graph structure by aggregating information from neighboring nodes.

But what if the graph is too dense? What if nearly every entity connects to nearly every other entity through short paths? Does graph structure still help, or does it become noise?

Experimental Setup

Dataset: Dense Academic Citation Graph

I constructed a knowledge graph from academic papers in machine learning and NLP:

3,215 papers spanning 2010-2023
85.8% entity overlap (most papers cite many of the same foundational works)
784 test queries covering various types of paper relationships

The key characteristic: this graph is dense. Most papers are connected through citation paths of length 2 or 3. This density is representative of real-world academic KGs where foundational papers create hub nodes.

Models Compared

I compared three path ranking approaches:

MLP Scorer: Simple feedforward network operating on path features (length, relation types, node features)
GCN (Graph Convolutional Network): Multi-layer GCN aggregating neighborhood information
GAT (Graph Attention Network): Attention-based GNN that learns to weight neighbors dynamically

Evaluation Metrics

AUC (Area Under ROC Curve): Primary metric for ranking quality
MRR (Mean Reciprocal Rank): Measures how quickly correct paths are ranked
Parameter Count: Model complexity and inference cost

Key Findings

Main Results

MLP Scorer: 93.9% AUC, 180K parameters
GCN: 90.8% AUC, 2.4M parameters
GAT: 87.4% AUC, 3.1M parameters

Simple MLP outperforms complex GNNs while using 13× fewer parameters

Why Do GNNs Fail in Dense Graphs?

Through ablation studies and error analysis, I identified two primary failure modes:

1. Hub-Noise Amplification

In dense graphs with high entity overlap, certain papers (like BERT, Attention Is All You Need, ImageNet) become hub nodes cited by thousands of papers. GNNs aggregate features from all neighbors, meaning:

Hub node representations become diluted averages of thousands of papers
Paths through hubs inherit this noisy, averaged representation
The model struggles to distinguish between paths that happen to pass through popular hubs

In contrast, the MLP scorer treats each path independently without aggregating across the full neighborhood, avoiding hub-induced noise.

2. Over-Smoothing

GNNs with multiple layers suffer from over-smoothing: node representations become increasingly similar as information propagates through the graph. In dense graphs where most nodes are within 2-3 hops of each other, this effect is pronounced.

After 3 GNN layers, I observed that node embeddings had average cosine similarity of 0.87—nearly all papers looked the same to the model.

When Lightweight Scorers Win

Based on these findings, I hypothesize that lightweight scorers (like MLPs) outperform GNNs when:

High density: Entity overlap exceeds ~75%
Hub-dominated: A small number of nodes have extremely high degree
Short paths: Most paths are length 2-4 (little benefit from multi-hop aggregation)
Strong node features: Individual node embeddings (e.g., paper abstracts encoded with BERT) are already highly informative

Methodology Details

Hybrid Retrieval Pipeline

The full KG-RAG system combines:

Semantic Retrieval: BERT-based embedding similarity to find candidate papers
Dual-Path Discovery:
- Citation paths (Paper A → cites → Paper B)
- Author paths (Paper A → authored_by → Author X → authored → Paper B)
Path Scoring: The lightweight MLP scorer ranks discovered paths
Answer Generation: GPT-4 synthesizes final answer using top-ranked paths as context

MLP Architecture

The winning MLP architecture is surprisingly simple:

Input: Concatenated node embeddings + relation type embeddings + path metadata (length, recency)
3 hidden layers (512 → 256 → 128 neurons)
ReLU activations with dropout (0.2)
Output: Single score representing path relevance

Total parameters: 180K. Compare this to 2.4M for GCN and 3.1M for GAT.

Implications

For Practitioners

Start simple: Before reaching for GNNs, try lightweight scorers on your KG-RAG tasks
Measure graph density: Entity overlap percentage is a good indicator of whether GNNs will help or hurt
Node features matter: Invest in high-quality node embeddings (e.g., BERT for text-rich graphs)

For Researchers

Density matters: Current GNN benchmarks focus on sparse graphs (social networks, molecular graphs). Dense KGs represent a different regime where GNN assumptions break down.
Hub robustness: Future GNN architectures should explicitly handle hub-dominated graphs, perhaps through degree-aware aggregation or selective neighbor sampling.
Task-specific architecture: Path ranking in dense graphs may fundamentally differ from node classification in sparse graphs—architectures should reflect this.

Limitations & Future Work

This research has several limitations:

Single domain: Results are based on academic citation graphs. Generalization to other dense KG domains (e.g., biological networks, knowledge bases) requires validation.
Static graphs: The citation graph is static. Dynamic KGs where structure evolves over time may behave differently.
Path length: Analysis focused on paths of length 2-5. Very long paths (6+) were not extensively studied.

Future work directions include:

Developing density-aware hybrid architectures that combine MLP and GNN approaches
Exploring selective neighbor aggregation to mitigate hub-noise amplification
Extending analysis to other dense KG domains beyond academic citations
Investigating whether findings hold for very large graphs (10M+ nodes)

Code & Resources

The paper will be published with full reproducibility materials:

Complete dataset (3,215 papers, 784 queries)
Model implementations (PyTorch) for MLP, GCN, and GAT scorers
Evaluation scripts and notebooks
Pre-trained model checkpoints

PRESENTATION AT ICMLC 2026

I'll be presenting this work at ICMLC 2026 in February. If you're attending or interested in discussing the research, I'd love to connect.

Get in Touch