PESU Venture Labs · Jan 2022 – Jan 2024
A hybrid BM25 + embedding retrieval system solving the vendor catalog alignment problem at scale
E-commerce platforms face a fundamental problem: the same product can have dozens of different names, descriptions, and attributes across different vendor catalogs. A "Samsung 65-inch 4K Smart TV" might be listed as "Samsung QN65Q60AAFXZA 65 QLED 4K" by one vendor and "Samsung 65 Inch Class Q60A QLED 4K Smart TV" by another.
For platforms aggregating products from multiple vendors, this heterogeneity makes it nearly impossible to:
Traditional exact-match or fuzzy-string approaches fail because they can't understand semantic equivalence. At PESU Venture Labs, I was tasked with building a solution that could harmonize vendor catalogs containing millions of SKUs.
Rather than choosing between lexical (BM25) and semantic (embeddings) approaches, I designed a hybrid system that leverages the strengths of both:
The key innovation was the custom harmonization pipeline that intelligently combines these signals based on the nature of the query and available product attributes.
One critical insight: not all product attributes should be treated equally. I separated features into two categories:
For categorical features, exact matching (via BM25) is heavily weighted because "Samsung" and "Sony" are semantically similar in embedding space but represent fundamentally different products.
Numerical features required special handling. Rather than treating "65 inch" and "55 inch" as text tokens, I implemented range-aware matching that understands hierarchical relationships (4K contains 1080p in capability, but not vice versa).
The harmonization pipeline processes each product through multiple stages:
Achieving sub-15 second query latency across multi-million SKU inventories required aggressive optimization:
For embedding-based retrieval, exact nearest neighbor search doesn't scale. I implemented HNSW (Hierarchical Navigable Small World) indexing, reducing search time from O(n) to O(log n) with minimal accuracy trade-off.
BM25 relies on inverted indexes. I optimized index construction by:
For many queries, the top results are obvious. I implemented confidence-based early termination—if the top candidate scores significantly higher than alternatives, skip full re-ranking.
Vendor catalogs are messy. Common issues included:
To handle this, I implemented a confidence scoring system that tracks data quality at the attribute level. Low-confidence attributes receive reduced weight in the final scoring function.
The system was deployed in production for a major e-commerce client, where it powers product matching across 20+ vendor catalogs. The client reported a 60% reduction in manual catalog curation effort.
Early experiments with BERT embeddings revealed a problem: products from the same category (e.g., all Samsung TVs) clustered too tightly in embedding space, making it hard to distinguish between different models.
Solution: Implemented contrastive learning with hard negative mining. The model learned to separate similar-but-different products (e.g., "Samsung 65 Q60A" vs "Samsung 65 Q70A") by explicitly training on these challenging cases.
Some vendors include extensive keyword stuffing in product descriptions ("TV television smart TV 4K TV…"), which inflates BM25 scores artificially.
Solution: Term frequency saturation using BM25+ variant, which penalizes excessive term repetition. Combined with term importance weighting based on inverse document frequency across the entire catalog.
Occasionally, products from different categories would match incorrectly (e.g., "Apple iPhone 13" matching "Apple MacBook 13-inch" due to shared brand and "13" appearing in both).
Solution: Category-aware filtering as a pre-processing step. Before detailed matching, enforce category consistency using a lightweight classifier. This reduced cross-category errors by 95%.
Several areas remain for future work:
WANT TO DISCUSS IR & RETRIEVAL SYSTEMS?
I'm happy to dive deeper into hybrid retrieval architectures, BM25 optimization techniques, or embedding-based search systems.
Let's Talk