PESU Venture Labs · Jan 2022 – Jan 2024

OneSKU

A hybrid BM25 + embedding retrieval system solving the vendor catalog alignment problem at scale

PyTorchYOLOOpenCVLlamaIndexOpenAI LLMsBM25BERTPython

The Challenge

E-commerce platforms face a fundamental problem: the same product can have dozens of different names, descriptions, and attributes across different vendor catalogs. A "Samsung 65-inch 4K Smart TV" might be listed as "Samsung QN65Q60AAFXZA 65 QLED 4K" by one vendor and "Samsung 65 Inch Class Q60A QLED 4K Smart TV" by another.

For platforms aggregating products from multiple vendors, this heterogeneity makes it nearly impossible to:

Traditional exact-match or fuzzy-string approaches fail because they can't understand semantic equivalence. At PESU Venture Labs, I was tasked with building a solution that could harmonize vendor catalogs containing millions of SKUs.

Technical Architecture

Hybrid Retrieval: Best of Both Worlds

Rather than choosing between lexical (BM25) and semantic (embeddings) approaches, I designed a hybrid system that leverages the strengths of both:

The key innovation was the custom harmonization pipeline that intelligently combines these signals based on the nature of the query and available product attributes.

Domain-Aware Feature Separation

One critical insight: not all product attributes should be treated equally. I separated features into two categories:

Categorical Features

For categorical features, exact matching (via BM25) is heavily weighted because "Samsung" and "Sony" are semantically similar in embedding space but represent fundamentally different products.

Numerical Features

Numerical features required special handling. Rather than treating "65 inch" and "55 inch" as text tokens, I implemented range-aware matching that understands hierarchical relationships (4K contains 1080p in capability, but not vice versa).

Custom Harmonization Pipeline

The harmonization pipeline processes each product through multiple stages:

  1. Attribute Extraction: Parse product titles and descriptions to extract structured attributes
  2. Normalization: Standardize units, brand names, and common variations
  3. Feature Encoding:
    • BM25 index for categorical and exact-match features
    • BERT embeddings for semantic descriptions
    • Specialized numerical encodings for size/capacity features
  4. Hybrid Scoring: Weighted combination based on feature confidence
  5. Re-ranking: Final re-ranking using cross-encoder for top candidates

Performance Optimizations

Sub-15s Query Latency at Scale

Achieving sub-15 second query latency across multi-million SKU inventories required aggressive optimization:

1. Approximate Nearest Neighbor Search

For embedding-based retrieval, exact nearest neighbor search doesn't scale. I implemented HNSW (Hierarchical Navigable Small World) indexing, reducing search time from O(n) to O(log n) with minimal accuracy trade-off.

2. Inverted Index Optimization

BM25 relies on inverted indexes. I optimized index construction by:

3. Early Termination

For many queries, the top results are obvious. I implemented confidence-based early termination—if the top candidate scores significantly higher than alternatives, skip full re-ranking.

Handling Noisy Data

Vendor catalogs are messy. Common issues included:

To handle this, I implemented a confidence scoring system that tracks data quality at the attribute level. Low-confidence attributes receive reduced weight in the final scoring function.

Real-World Impact

Key Results

  • Query Latency: Sub-15s across multi-million SKU inventories
  • Accuracy: 94% precision on vendor catalog matching benchmarks
  • Scale: Successfully deployed on catalogs with 5M+ products
  • Recall Improvement: 40% increase vs. pure BM25 baseline

The system was deployed in production for a major e-commerce client, where it powers product matching across 20+ vendor catalogs. The client reported a 60% reduction in manual catalog curation effort.

Technical Challenges & Solutions

Challenge 1: Embedding Space Collapse

Early experiments with BERT embeddings revealed a problem: products from the same category (e.g., all Samsung TVs) clustered too tightly in embedding space, making it hard to distinguish between different models.

Solution: Implemented contrastive learning with hard negative mining. The model learned to separate similar-but-different products (e.g., "Samsung 65 Q60A" vs "Samsung 65 Q70A") by explicitly training on these challenging cases.

Challenge 2: BM25 Keyword Stuffing

Some vendors include extensive keyword stuffing in product descriptions ("TV television smart TV 4K TV…"), which inflates BM25 scores artificially.

Solution: Term frequency saturation using BM25+ variant, which penalizes excessive term repetition. Combined with term importance weighting based on inverse document frequency across the entire catalog.

Challenge 3: Cross-Category Leakage

Occasionally, products from different categories would match incorrectly (e.g., "Apple iPhone 13" matching "Apple MacBook 13-inch" due to shared brand and "13" appearing in both).

Solution: Category-aware filtering as a pre-processing step. Before detailed matching, enforce category consistency using a lightweight classifier. This reduced cross-category errors by 95%.

Lessons Learned

  1. Hybrid approaches dominate: Pure neural models sound elegant, but combining them with classical IR techniques (BM25) consistently outperforms either alone.
  2. Domain knowledge is crucial: Understanding that brand names require exact matching while descriptions benefit from semantic similarity made the difference between 75% and 94% accuracy.
  3. Data quality matters more than model complexity: Spending time on robust attribute extraction and normalization improved results more than trying increasingly complex models.
  4. Performance is a feature: Sub-15s latency was a hard requirement. Users won't wait 60 seconds for catalog matching results.

Future Improvements

Several areas remain for future work:

WANT TO DISCUSS IR & RETRIEVAL SYSTEMS?

I'm happy to dive deeper into hybrid retrieval architectures, BM25 optimization techniques, or embedding-based search systems.

Let's Talk