Abstract
Knowledge Graph-augmented Retrieval-Augmented Generation (KG-RAG) systems have emerged as a powerful approach for question answering over structured knowledge. The conventional wisdom suggests that Graph Neural Networks (GNNs) should excel at ranking paths in knowledge graphs due to their ability to aggregate neighborhood information.
However, our research demonstrates a counterintuitive finding: in dense academic citation graphs with high entity overlap, simple MLP-based scorers outperform complex GNN architectures while using 13× fewer parameters.
This paper was accepted to ICMLC 2026 and will be presented in February 2026.
The Research Question
Knowledge graph RAG systems typically follow a two-stage process:
- Path Discovery: Find potential paths in the KG connecting query entities to candidate answers
- Path Ranking: Score and rank these paths by relevance
For path ranking, GNNs (Graph Convolutional Networks and Graph Attention Networks) are the popular choice. They're designed to leverage graph structure by aggregating information from neighboring nodes.
But what if the graph is too dense? What if nearly every entity connects to nearly every other entity through short paths? Does graph structure still help, or does it become noise?
Experimental Setup
Dataset: Dense Academic Citation Graph
I constructed a knowledge graph from academic papers in machine learning and NLP:
- 3,215 papers spanning 2010-2023
- 85.8% entity overlap (most papers cite many of the same foundational works)
- 784 test queries covering various types of paper relationships
The key characteristic: this graph is dense. Most papers are connected through citation paths of length 2 or 3. This density is representative of real-world academic KGs where foundational papers create hub nodes.
Models Compared
I compared three path ranking approaches:
- MLP Scorer: Simple feedforward network operating on path features (length, relation types, node features)
- GCN (Graph Convolutional Network): Multi-layer GCN aggregating neighborhood information
- GAT (Graph Attention Network): Attention-based GNN that learns to weight neighbors dynamically
Evaluation Metrics
- AUC (Area Under ROC Curve): Primary metric for ranking quality
- MRR (Mean Reciprocal Rank): Measures how quickly correct paths are ranked
- Parameter Count: Model complexity and inference cost
Key Findings
Main Results
- MLP Scorer: 93.9% AUC, 180K parameters
- GCN: 90.8% AUC, 2.4M parameters
- GAT: 87.4% AUC, 3.1M parameters
Simple MLP outperforms complex GNNs while using 13× fewer parameters
Why Do GNNs Fail in Dense Graphs?
Through ablation studies and error analysis, I identified two primary failure modes:
1. Hub-Noise Amplification
In dense graphs with high entity overlap, certain papers (like BERT, Attention Is All You Need, ImageNet) become hub nodes cited by thousands of papers. GNNs aggregate features from all neighbors, meaning:
- Hub node representations become diluted averages of thousands of papers
- Paths through hubs inherit this noisy, averaged representation
- The model struggles to distinguish between paths that happen to pass through popular hubs
In contrast, the MLP scorer treats each path independently without aggregating across the full neighborhood, avoiding hub-induced noise.
2. Over-Smoothing
GNNs with multiple layers suffer from over-smoothing: node representations become increasingly similar as information propagates through the graph. In dense graphs where most nodes are within 2-3 hops of each other, this effect is pronounced.
After 3 GNN layers, I observed that node embeddings had average cosine similarity of 0.87—nearly all papers looked the same to the model.
When Lightweight Scorers Win
Based on these findings, I hypothesize that lightweight scorers (like MLPs) outperform GNNs when:
- High density: Entity overlap exceeds ~75%
- Hub-dominated: A small number of nodes have extremely high degree
- Short paths: Most paths are length 2-4 (little benefit from multi-hop aggregation)
- Strong node features: Individual node embeddings (e.g., paper abstracts encoded with BERT) are already highly informative
Methodology Details
Hybrid Retrieval Pipeline
The full KG-RAG system combines:
- Semantic Retrieval: BERT-based embedding similarity to find candidate papers
- Dual-Path Discovery:
- Citation paths (Paper A → cites → Paper B)
- Author paths (Paper A → authored_by → Author X → authored → Paper B)
- Path Scoring: The lightweight MLP scorer ranks discovered paths
- Answer Generation: GPT-4 synthesizes final answer using top-ranked paths as context
MLP Architecture
The winning MLP architecture is surprisingly simple:
- Input: Concatenated node embeddings + relation type embeddings + path metadata (length, recency)
- 3 hidden layers (512 → 256 → 128 neurons)
- ReLU activations with dropout (0.2)
- Output: Single score representing path relevance
Total parameters: 180K. Compare this to 2.4M for GCN and 3.1M for GAT.
Implications
For Practitioners
- Start simple: Before reaching for GNNs, try lightweight scorers on your KG-RAG tasks
- Measure graph density: Entity overlap percentage is a good indicator of whether GNNs will help or hurt
- Node features matter: Invest in high-quality node embeddings (e.g., BERT for text-rich graphs)
For Researchers
- Density matters: Current GNN benchmarks focus on sparse graphs (social networks, molecular graphs). Dense KGs represent a different regime where GNN assumptions break down.
- Hub robustness: Future GNN architectures should explicitly handle hub-dominated graphs, perhaps through degree-aware aggregation or selective neighbor sampling.
- Task-specific architecture: Path ranking in dense graphs may fundamentally differ from node classification in sparse graphs—architectures should reflect this.
Limitations & Future Work
This research has several limitations:
- Single domain: Results are based on academic citation graphs. Generalization to other dense KG domains (e.g., biological networks, knowledge bases) requires validation.
- Static graphs: The citation graph is static. Dynamic KGs where structure evolves over time may behave differently.
- Path length: Analysis focused on paths of length 2-5. Very long paths (6+) were not extensively studied.
Future work directions include:
- Developing density-aware hybrid architectures that combine MLP and GNN approaches
- Exploring selective neighbor aggregation to mitigate hub-noise amplification
- Extending analysis to other dense KG domains beyond academic citations
- Investigating whether findings hold for very large graphs (10M+ nodes)
Code & Resources
The paper will be published with full reproducibility materials:
- Complete dataset (3,215 papers, 784 queries)
- Model implementations (PyTorch) for MLP, GCN, and GAT scorers
- Evaluation scripts and notebooks
- Pre-trained model checkpoints
PRESENTATION AT ICMLC 2026
I'll be presenting this work at ICMLC 2026 in February. If you're attending or interested in discussing the research, I'd love to connect.
Get in Touch