OneSKU - Hybrid Retrieval System | Pranav Kumaar Sridhar

The Challenge

E-commerce platforms face a fundamental problem: the same product can have dozens of different names, descriptions, and attributes across different vendor catalogs. A "Samsung 65-inch 4K Smart TV" might be listed as "Samsung QN65Q60AAFXZA 65 QLED 4K" by one vendor and "Samsung 65 Inch Class Q60A QLED 4K Smart TV" by another.

For platforms aggregating products from multiple vendors, this heterogeneity makes it nearly impossible to:

Deduplicate identical products
Compare prices across vendors
Provide accurate search results to customers
Maintain consistent product catalogs

Traditional exact-match or fuzzy-string approaches fail because they can't understand semantic equivalence. At PESU Venture Labs, I was tasked with building a solution that could harmonize vendor catalogs containing millions of SKUs.

Technical Architecture

Hybrid Retrieval: Best of Both Worlds

Rather than choosing between lexical (BM25) and semantic (embeddings) approaches, I designed a hybrid system that leverages the strengths of both:

BM25 for precision: Excellent at matching exact terms, model numbers, and brand names
Embeddings for recall: Captures semantic similarity even when terminology differs

The key innovation was the custom harmonization pipeline that intelligently combines these signals based on the nature of the query and available product attributes.

Domain-Aware Feature Separation

One critical insight: not all product attributes should be treated equally. I separated features into two categories:

Categorical Features

Brand names (e.g., "Samsung" vs "Sony")
Product categories (e.g., "Television" vs "Monitor")
Discrete attributes (e.g., "WiFi enabled" vs "No WiFi")

For categorical features, exact matching (via BM25) is heavily weighted because "Samsung" and "Sony" are semantically similar in embedding space but represent fundamentally different products.

Numerical Features

Screen sizes (e.g., 65" vs 55")
Resolutions (e.g., 4K vs 1080p)
Capacities (e.g., 256GB vs 512GB)

Numerical features required special handling. Rather than treating "65 inch" and "55 inch" as text tokens, I implemented range-aware matching that understands hierarchical relationships (4K contains 1080p in capability, but not vice versa).

Custom Harmonization Pipeline

The harmonization pipeline processes each product through multiple stages:

Attribute Extraction: Parse product titles and descriptions to extract structured attributes
Normalization: Standardize units, brand names, and common variations
Feature Encoding:
- BM25 index for categorical and exact-match features
- BERT embeddings for semantic descriptions
- Specialized numerical encodings for size/capacity features
Hybrid Scoring: Weighted combination based on feature confidence
Re-ranking: Final re-ranking using cross-encoder for top candidates

Performance Optimizations

Sub-15s Query Latency at Scale

Achieving sub-15 second query latency across multi-million SKU inventories required aggressive optimization:

1. Approximate Nearest Neighbor Search

For embedding-based retrieval, exact nearest neighbor search doesn't scale. I implemented HNSW (Hierarchical Navigable Small World) indexing, reducing search time from O(n) to O(log n) with minimal accuracy trade-off.

2. Inverted Index Optimization

BM25 relies on inverted indexes. I optimized index construction by:

Using vocabulary pruning to remove uninformative terms
Implementing posting list compression (variable-byte encoding)
Caching frequently accessed posting lists in memory

3. Early Termination

For many queries, the top results are obvious. I implemented confidence-based early termination—if the top candidate scores significantly higher than alternatives, skip full re-ranking.

Handling Noisy Data

Vendor catalogs are messy. Common issues included:

Missing attributes (e.g., no screen size listed)
Inconsistent formatting ("4K" vs "UHD" vs "2160p")
Incorrect categorization (monitors listed as TVs)
Promotional text mixed with product specs

To handle this, I implemented a confidence scoring system that tracks data quality at the attribute level. Low-confidence attributes receive reduced weight in the final scoring function.

Real-World Impact

Key Results

Query Latency: Sub-15s across multi-million SKU inventories
Accuracy: 94% precision on vendor catalog matching benchmarks
Scale: Successfully deployed on catalogs with 5M+ products
Recall Improvement: 40% increase vs. pure BM25 baseline

The system was deployed in production for a major e-commerce client, where it powers product matching across 20+ vendor catalogs. The client reported a 60% reduction in manual catalog curation effort.

Technical Challenges & Solutions

Challenge 1: Embedding Space Collapse

Early experiments with BERT embeddings revealed a problem: products from the same category (e.g., all Samsung TVs) clustered too tightly in embedding space, making it hard to distinguish between different models.

Solution: Implemented contrastive learning with hard negative mining. The model learned to separate similar-but-different products (e.g., "Samsung 65 Q60A" vs "Samsung 65 Q70A") by explicitly training on these challenging cases.

Challenge 2: BM25 Keyword Stuffing

Some vendors include extensive keyword stuffing in product descriptions ("TV television smart TV 4K TV…"), which inflates BM25 scores artificially.

Solution: Term frequency saturation using BM25+ variant, which penalizes excessive term repetition. Combined with term importance weighting based on inverse document frequency across the entire catalog.

Challenge 3: Cross-Category Leakage

Occasionally, products from different categories would match incorrectly (e.g., "Apple iPhone 13" matching "Apple MacBook 13-inch" due to shared brand and "13" appearing in both).

Solution: Category-aware filtering as a pre-processing step. Before detailed matching, enforce category consistency using a lightweight classifier. This reduced cross-category errors by 95%.

Lessons Learned

Hybrid approaches dominate: Pure neural models sound elegant, but combining them with classical IR techniques (BM25) consistently outperforms either alone.
Domain knowledge is crucial: Understanding that brand names require exact matching while descriptions benefit from semantic similarity made the difference between 75% and 94% accuracy.
Data quality matters more than model complexity: Spending time on robust attribute extraction and normalization improved results more than trying increasingly complex models.
Performance is a feature: Sub-15s latency was a hard requirement. Users won't wait 60 seconds for catalog matching results.

Future Improvements

Several areas remain for future work:

Incorporating product images for visual similarity matching
Active learning to improve the model using user feedback
Multilingual support for international vendor catalogs
Real-time catalog updates with incremental index maintenance

WANT TO DISCUSS IR & RETRIEVAL SYSTEMS?

I'm happy to dive deeper into hybrid retrieval architectures, BM25 optimization techniques, or embedding-based search systems.

Let's Talk